How to prevent segmentation fault of concurrent jobs in Bazel? - bazel

I wrote a rule to run some compiler (Synopsys VCS MX). When running a single target, everything works great. When running multiple targets concurrently, the compiler runs into a segmentation fault. This doesn't happen when running Bazel with --spawn_strategy=local. Also setting --jobs 1 works.
The only reason for this that I can think of, is that the compiler tries to write to a file with an absolute path, colliding with other instances of itself.
My questions are as follows:
If my theory was correct, wouldn't the problem occur regardless of weather I'm sandboxing or not?
If I'm wrong, how could the compilers be colliding if not because of some shared file?
Say that for every sandbox, I wanted to mount a /tmp which points to a different directory, would that be possible?
Update:
According to what I saw in strace, both instances of the compiler open a file /tmp/vcs_20200428163636_3/v710_tok for reading and writing, and at some point one instance calls pread64() which causes the segfault. Notice the files name, which looks suspiciously like the date hinting there was an attempt get a unique file name, but both instances weren't executed far enough apart.
Question 1 and 3 still stand.

The solution:
By adding --sandbox_tmpfs_path=/tmp the problem was solved. This tells Bazel, that when creating a sandbox for an action, it should mount an empty writable directory mounted to the path /tmp. This way each compiler has its own /tmp and they don't collide.
Why does the collision happen only when sandboxing?
When executing run_shell in a sandbox, Bazel will execute shell using clone, which causes it to run in a new PID namespace. The PID of the compiler (3 in this case as can be seen in /tmp/vcs_20200428163636_3/v710_tok) is added to the file opened in /tmp, in an attempt to make the file name unique. However, because both compilers are forked within their separate sandbox PID namespaces, they both see their PID relative to their sandbox, thus allowing them to collide.

Related

:erlang.load_nif/2 finds shared library file inside original project but can't find it if the project gets imported

I've build a small elixir application that uses NIF functions to execute some c++ code.
The nifs are loaded via:
def load_nifs do
:erlang.load_nif('<relative_path_to_lib>/<lib_name>', 0)
:ok
end
and this works fine.
Now I want to integrate this app into another project. The problem now is that load_nif throws:
Failed to load NIF library: '<relative_path_to_lib>/<lib_name>.so: cannot open shared object file: No such file or directory''
although nothing has changed. I checked the deps folder and the shared library files are exactly where they are supposed to be, so the dependency seems to be loaded correctly. I also tried putting the .so files into the same folder as the module that calls load_nif (and omit <relative_path_to_lib>/) as well as providing an absolute path, all to no avail.
Any help is appreciated, Cheers.
Relevant info regarding my system:
OS: Ubuntu 22.04
Elixir version: Elixir 1.13.0 (compiled with Erlang/OTP 24)
Update:
The issue does not seem to be that files are located at the wrong place, as it finds the files during the first test run after compilation.
However, the error occurs when I repeat the run. It seems that the error message is wrong, since no files are deleted during the test.
If I repeat the function within one test multiple times there's no problem, so the issue is not created because the NIF function is executed multiple times, but because the test that contains the function is repeated multiple times.
Solution:
I still have no idea what causes this behavior but after putting the .so files into a priv directory and accessing them via
:erlang.load_nif(:code.priv_dir(:<app_name>), 0)
the tests pass.

Optional file dependencies in Bazel?

Is there a way to specify optional dependencies in Bazel?
I'd like to make a rule to somewhat mirror Kitware's ExternalData, but I would like to see if I can enable workflows where the developer edits the file in-tree, ideally without needing to modify the BUILD file.
Ideal Workflow
Define a rule, external_data, which can fetch a file from a given server given its SHA-512.
If the file already exists, check it's SHA-512.
If that is what is requested, symlink / copy this file (ensuring that no tests can modify the original file).
If it is different, print a warning, but proceed as normal, to allow for developers to quickly modify the large files as they need.
I would like to do this such that Bazel can switch between the file being present and not, and be robust to false-positives on caching. An example scenario that I would like to avoid, if I were to not include it as an optional dependency:
In a prior run, the file was in the workspace, Bazel built the target, everything's fine and dandy.
Developer removes the file from the workspace after uploading, satisfied with their changes and wanting to test the download process.
When running the downstream target, Bazel doesn't care about the change in the workspace since it's not an explicit dependency, and the symlink is invalidated, and the test crashes and burns.
To me, it seems like I'd run into this if I tried to implement a repository_rule rule which manually checks for the file existence, and conditionally executes (I'm not sure if analysis would retrigger this rule being "evaluated" if Step 2 happens.).
Workaround
My current thought for an alternative workflow is to have an explicit option for external_data, use_workspace: if False, it will download the file; if True, it will just mirror exports_files([]). The developer can then set this when modifying files.
(Ideally, I'd like to optionally include a file which indicates the SHA (${file}.sha512), but this seems to go back to the original ask.)
One workaround is to use Bazel's glob(...) method to effectively check for file existence.
If you have a file, say basic.bin.sha512, and you want a rule to switch modes based on that file's existence, you can use glob(["basic.bin.sha512"]), which will either match the package file exactly or return an empty list.
I had tinkered around with using this on larger sets of files, and it appears to work. However, for the time being, I've erred to having a sort-of explicit "development" mode for the target definition to keep the Bazel build relatively consistent, regardless of what files may be checked out.
Here's an example usage:
https://github.com/EricCousineau-TRI/external_data_bazel/blob/4bf1dff/WORKFLOWS.md#edit-files-in-a-sha512-group

Cobol JCL error IEW

Can anyone help me with this Jcl error? It is the only thing stopping my code from compiling and running. It keeps giving back a max CC of 12 even though the code is running perfectly.
IEW2736s There is no space left in the directory for DDName Syslmod. Stow of the
Directory entry member name Strbrk failed.
I haven't been able to find a fix for it anywhere , so I feel this is my last hope.
As the message says, the directory for the library on SYSLMOD is full, so your (new) member cannot be added.
If it is a library which you defined yourself, make a copy of the data, ensure the copy worked, then delete and redefine the library with more space for the directory, copy the backed-up data to your newly defined library.
If you are uncertain on how to do these, seek advice from your colleagues/technical support.
If it is not a library that you defined, find out who is responsible for it, and ask that it be extended.
If there are members on the library which you put there and which you no longer need, you can delete (at least one) and continue as a short-term thing.

Can WinDBG be made to find mscordacwks.dll in the symbol store?

The Question
There are plenty of manual ways to make WinDBG find mscordacwks.dll without a symbol store (putting the file in the path somewhere, putting it in the same folder as windbg.exe, putting it in my Framework\v folder, specifying the path in WinDBG using .cordll -lp c:\dacFolder, etc.), but they all only fix it for me. I need to fix it more generally for everyone who uses my symbol store.
The possible solutions I can imagine are:
WinDBG be made to check the symbol store using mscordacwks.dll's subfolder name instead of mscorwks.dll's folder name.
SymStore.exe be made to add mscordacwks.dll under mscorwks.dll's subfolder name so WinDBG finds it when it looks there.
Q: Are either of these things possible, or is there another way that I'm not thinking of to solve the problem?
The Background
When analyzing a .NET process, I encountered the (apparently common) problem that psscor2 (and sosex) could not find the appropriate mscordacwks.dll on my machine. The error in WinDBG is:
Failed to load data access DLL, 0x80004005
Verify that 1) you have a recent build of the debugger (6.2.14 or newer)
2) the file mscordacwks.dll that matches your version of mscorwks.dll is
in the version directory
3) or, if you are debugging a dump file, verify that the file
mscordacwks_<arch>_<arch>_<version>.dll is on your symbol path.
4) you are debugging on the same architecture as the dump file.
For example, an IA64 dump file must be debugged on an IA64
machine.
You can also run the debugger command .cordll to control the debugger's
load of mscordacwks.dll. .cordll -ve -u -l will do a verbose reload.
If that succeeds, the SOS command should work on retry.
If you are debugging a minidump, you need to make sure that your executable
path is pointing to mscorwks.dll as well.
There are plenty of SO questions on this and plenty of good answers, practically all of which ultimately reference Doug Stewart's outstanding blog post, What is mscordacwks.dll?.
Thanks to that, I got my situation all straightened out by obtaining the correct mscordacwks.dll and placing it here:
"C:\Symbols\mscordacwks_AMD64_AMD64_2.0.50727.4216.dll\4E1545829a3000\mscordacwks_AMD64_AMD64_2.0.50727.4216.dll"
where I knew WinDBG would look because I had previously tried it with !sym noisy.
So I'm all set now, but I had to put it in that path physically rather than adding it to my symbol server through the normal symstore.exe mechanism. Since my symbol store is used by more than just me, I need to do it the right way for everyone else using the store.
And that's the problem. When I add using symstore.exe instead of going into the above path, it goes into:
"C:\Symbols\mscordacwks_AMD64_AMD64_2.0.50727.4216.dll\4E1545CB1bd000\mscordacwks_AMD64_AMD64_2.0.50727.4216.dll"
The only difference being that the subfolder name is 4E1545CB1bd000 here instead of the 4E1545829a3000 that WinDBG is looking for.
The reason for this is that when adding a binary to the symbol store, symstore.exe uses the PE of the binary to get the image timestamp and the image size. In the case of this particular .dll, dumpbin.exe /headers mscordacwks.dll reveals these to be:
image timestamp: 0x4E1545CB (Thu Jul 07 01:36:11 2011)
image size: 0x1BD000
Hence, the subfolder name 4E1545CB1BD000.
What WinDBG is looking for, on the other hand, is a subfolder based on the image timestamp and image size of mscorwks.dll, not mscordacwks.dll, because the former is loaded into the process, not the latter. WinDBG can't know the timestamp and size of the DAC module because that module is not in the process dump.
As further verification of this explanation, dumpbin.exe /headers mscorwks.dll reveals:
image timestamp: 0x4E154582 (Thu Jul 07 01:34:58 2011)
image size: 0x9A3000
which you can see add up to subfolder name 4E1545829A3000.
Knowing this, now it makes a lot more sense why all these many versions of the mscordacwks.dll that people keep running into seem to be missing from Microsoft's symbol servers. I'm sure they're there, it's just that WinDBG and psscor2 can't find them because they're picking the wrong subfolder name. Why it even bothers searching the symbols path is beyond me since it's guaranteed never to find it!
So that's my challenge. Can I somehow force symstore.exe to add mscordacwks.dll using the PE info of mscorwks.dll? If not, am I missing something about WinDBG and psscor2, might there be a way for them to know the correct timestamp and size of mscordacwks.dll even though it's not loaded (and a way for them to actually use those instead of mscorwks.dll)?
Since no other solution has appeared and my workaround seems to handle everything nicely, I'm just going to keep on with that, and I would recommend anyone else who never wants to see the annoying Failed to load data access DLL, 0x80004005 error again do the same.
So to make this work for you (and everyone who uses your symbol store, so I really wish Microsoft would do this to save us all a lot of trouble) simply place the compressed DAC file (mscordacwks.dl_) by hand into the correct path within your local SYM store.
Here are the steps I follow to accomplish this:
In WinDBG do a !sym noisy
In WinDBG do a .cordll -ve -u -l
In WinDBG do another !CLRStack or other psscor2 command if necessary to force it to load symbols again
The symbol search logging will reveal the dll it’s looking for and where it’s looking in your symbol store by showing lines like this: C:\Symbols\mscordacwks_AMD64_AMD64_2.0.50727.4216.dll\4E1545829a3000\mscordacwks_AMD64_AMD64_2.0.50727.4216.dll which indicates two things:
you need the 64-bit mscordacwks.dll of version 2.0.50727.4216; see https://stackoverflow.com/a/12024171/1910619 for ways to get it
it needs to go in a subfolder called 4E1545829a3000 under a folder called mscordacwks_AMD64_AMD64_2.0.50727.4216.dll in your symbol store
Once you obtain the file, rename it according to the name WinDBG is looking for, e.g. "mscordacwks_AMD64_AMD64_2.0.50727.4216.dll"
Manually compress this file using makecab.exe like this: makecab /D CompressionType=LZX /D CompressionMemory=21 mscordacwks_AMD64_AMD64_2.0.50727.4216.dll mscordacwks_AMD64_AMD64_2.0.50727.4216.dl_
Copy that compressed file to the expected place in the symbol store. (That you found in step 4 above, so C:\Symbols\mscordacwks_AMD64_AMD64_2.0.50727.4216.dll\4E1545829a3000\mscordacwks_AMD64_AMD64_2.0.50727.4216.dl_ in our running example here.)
You have to put the CLR dll and the associated mscordacwks dll in the same folder and register the CLR dll with symstore.
In that case symstore will add the clr and the mscordacwks dll on the symbol store.
More importantly, it will use the timestamp and file size of the clr dll to create the mscordacwks subfolder, so windbg can find the mscordacwks dll when debugging a dump.
The mscordacwks name must follow the mscordacwks_ARCH_ARCH_fileversion pattern, otherwise symstore won't add it to the symbol store.
I didn't find any documentation on that feature so it may be removed in the future.
Here is the command and the symstore output:
symstore.exe add /o /f 4.6.1076.00\clr.dll /t clr.dll /s \\mystore\microsoft
SYMSTORE MESSAGE: 0 alternate indexers registered
SYMSTORE MESSAGE: LastId.txt reported id 0
SYMSTORE MESSAGE: History.txt reported id 58228
SYMSTORE MESSAGE: Final id is 0000058228
SYMSTORE MESSAGE: Copying C:\Users\build.robot\symstore\4.6.1076.00\clr.dll to \\mystore\microsoft\clr.dll\56D79ED4990000\clr.dll [Force: T, Compress: F]
SYMSTORE MESSAGE: Copying 4.6.1076.00\mscordacwks_AMD64_AMD64_4.6.1076.00.dll to \\mystore\microsoft\mscordacwks_AMD64_AMD64_4.6.1076.00.dll\56D79ED4990000\mscordacwks_AMD64_AMD64_4.6.1076.00.dll [Force: T, Compress: F]
SYMSTORE: Number of files stored = 2
SYMSTORE: Number of errors = 0
SYMSTORE: Number of files ignored = 0

keep rsync from removing unfinished source files

I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run:
$ rsync --remove-source-files speed:/var/crawldir .
but I worry that rsync will unlink a source file that hasn't finished downloading yet. (I looked at the source code and I didn't see anything protecting against this.) Any suggestions?
It seems to me the problem is transferring a file before it's complete, not that you're deleting it.
If this is Linux, it's possible for a file to be open by process A and process B can unlink the file. There's no error, but of course A is wasting its time. Therefore, the fact that rsync deletes the source file is not a problem.
The problem is rsync deletes the source file only after it's copied, and if it's still being written to disk you'll have a partial file.
How about this: Mount mass as a remote file system (NFS would work) in speed. Then just web-crawl the files directly.
How much control do you have over the download process? If you roll your own, you can have the file being downloaded go to a temp directory or have a temporary name until it's finished downloading, and then mv it to the correct name when it's done. If you're using third party software, then you don't have as much control, but you still might be able to do the temp directory thing.
Rsync can exclude files matching certain patters. Even if you can't modify it to make it download files to a temporary directory, maybe it has a convention of naming the files differently during download (for example: foo.downloading while downloading for a file named foo) and you can use this property to exclude files which are still being downloaded from being copied.
If you have control over the crawling process, or it has predictable output, the above solutions (storing in a tempfile until finished, then mv'ing to the completed-downloads place, or ignoring files with a '.downloading' kind of name) might work. If all of that is beyond your control, you can make sure that the file is not opened by any process by doing 'lsof $filename' and checking if there's a result. Clearly if no one has the file open, it's safe to move it over.

Resources