Uploading Program to OpenMPI gives initialization error, on IntelMPI memory leak - memory

I am a graduate student (master's) and use an in-house code for running my simulations that use MPI. Earlier, I used OpenMPI on a supercomputer we used to access and since it shut down I've been trying to switch to another supercomputer that has Intel MPI installed on it. The problem is, the same code that was working perfectly fine earlier now gives memory leaks after a set number of iterations (time steps). Since the code is relatively large and my knowledge of MPI is very basic, it is proving very difficult to debug it.
So I installed OpenMPI onto this new supercomputer I am using, but it gives the following error message upon execution and then terminates:
Invalid number of PE
Please check partitioning pattern or number of PE
NOTE: The error message is repeated for as many numbers of nodes I used to run the case (here, 8). Compiled using mpif90 with -fopenmp for thread parallelisation.
There is in fact no guarantee that running it on OpenMPI won't give the memory leak, but it is worth a shot I feel, as it was running perfectly fine earlier.
PS: On Intel MPI, this is the error I got (compiled with mpiifort with -qopenmp)
Abort(941211497) on node 16 (rank 16 in comm 0): Fatal error in PMPI_Isend: >Unknown error class, error stack:
PMPI_Isend(152)...........: MPI_Isend(buf=0x2aba1cbc8060, count=4900, dtype=0x4c000829, dest=20, tag=0, MPI_COMM_WORLD, request=0x7ffec8586e5c) failed
MPID_Isend(662)...........:
MPID_isend_unsafe(282)....:
MPIDI_OFI_send_normal(305): failure occurred while allocating memory for a request object
Abort(203013993) on node 17 (rank 17 in comm 0): Fatal error in PMPI_Isend: >Unknown error class, error stack:
PMPI_Isend(152)...........: MPI_Isend(buf=0x2b38c479c060, count=4900, dtype=0x4c000829, dest=21, tag=0, MPI_COMM_WORLD, request=0x7fffc20097dc) failed
MPID_Isend(662)...........:
MPID_isend_unsafe(282)....:
MPIDI_OFI_send_normal(305): failure occurred while allocating memory for a request object
[mpiexec#cx0321.obcx] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:357): write error (Bad file descriptor)
[mpiexec#cx0321.obcx] cmd_bcast_root (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:164): error sending cmd 15 to proxy
[mpiexec#cx0321.obcx] send_abort_rank_downstream (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:557): unable to send response downstream
[mpiexec#cx0321.obcx] control_cb (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1576): unable to send abort rank to downstreams
[mpiexec#cx0321.obcx] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[mpiexec#cx0321.obcx] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1962): error waiting for event"
I will be happy to provide the code in case somebody is willing to take a look at it. It is written using Fortran with some of the functions written in C. My research progress has been completely halted due to this problem and nobody at my lab has enough experience with MPI to resolve this.

Related

libfuzzer fuzzing harness crash not reproducible

I want to fuzz an existing harness from stbi harness and make a small change. From free(img) to if(img) free(img);
compile with this command clang -fsanitize=fuzzer,address -ggdb -O0 stbi_read_fuzzer.c -o fuzzer, and run with ./fuzzer corpus -fork=1 -ignore_crashes=1 -dict=jpeg.dict -seed=123
After few hours it produce some crash (global buffer overflow, heap use after free, buffer overflow). But when I run all crash file it didn't crash
aldo#vps:~/stb/tests$ ./fuzzer crash-edab9036233c269e258fe93c2a46d46d5d6e7112
INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 2279336272
INFO: Loaded 1 modules (2132 inline 8-bit counters): 2132 [0x61b510, 0x61bd64),
INFO: Loaded 1 PC tables (2132 PCs): 2132 [0x5d0258,0x5d8798),
./fuzzer: Running 1 inputs 1 time(s) each.
Running: crash-edab9036233c269e258fe93c2a46d46d5d6e7112
Executed crash-edab9036233c269e258fe93c2a46d46d5d6e7112 in 3 ms
***
*** NOTE: fuzzing was not performed, you have only
*** executed the target code on a fixed set of inputs.
***
Why it didn't crash?
I'm using ubuntu 20.04 with llvm12 from apt.llvm.org
Old question but I had a similar issue.
In my case I fuzzed a stateful API and forgot to reset the API at the beginning of LLVMFuzzerTestOneInput. So a previous invocation set the API in a invalid state but it didn't crash right away. Only on second invocation the API did crash.
So my guess would be that similarly in your harness some internal state / global variable was changed in a previous invocation. Try to reset everything.
This is because libFuzzer runs in-process and just calls the LLVMFuzzerTestOneInput function as often as possible. The program won't get re-initialized on its own. This is documented:
The fuzzing engine will execute the fuzz target many times with different inputs in the same process.
Ideally, it should not modify any global state (although that’s not strict).

Dataflow concurrency error with ValueState

The Beam 2.1 pipeline uses ValueState in a stateful DoFn. It runs fine with a single worker but when scaling is enabled will fail with "Unable to read value from state" and the root exception below. Any ideas what could cause this?
Caused by: java.util.concurrent.ExecutionException: com.google.cloud.dataflow.worker.KeyTokenInvalidException: Unable to fetch data due to token mismatch for key ��
at com.google.cloud.dataflow.worker.repackaged.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500)
at com.google.cloud.dataflow.worker.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:459)
at com.google.cloud.dataflow.worker.repackaged.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
at com.google.cloud.dataflow.worker.repackaged.com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:62)
at com.google.cloud.dataflow.worker.WindmillStateReader$WrappedFuture.get(WindmillStateReader.java:309)
at com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillValue.read(WindmillStateInternals.java:384)
... 16 more
Caused by: com.google.cloud.dataflow.worker.KeyTokenInvalidException: Unable to fetch data due to token mismatch for key ��
at com.google.cloud.dataflow.worker.WindmillStateReader.consumeResponse(WindmillStateReader.java:469)
at com.google.cloud.dataflow.worker.WindmillStateReader.startBatchAndBlock(WindmillStateReader.java:411)
at com.google.cloud.dataflow.worker.WindmillStateReader$WrappedFuture.get(WindmillStateReader.java:306)
... 17 more
I believe that exception should just be rethrown. It is thrown by the state mechanism to indicate that additional work on that key should not be performed, and will be automatically retried by the Dataflow runner.
These typically indicate that either that particular work should be performed on a different worker (thus proceeding wouldn't be helpful).
It may be possible that misusing state -- storing the state object from one key and attempting to use it on a different key -- could also lead to these errors. If that is the case, you may be able to see more diagnostic messages in either the worker or shuffler logs in Stackdriver logging.
If neither retrying nor looking at logging and how you use the state objects help, please provide a job ID demonstrating the problem.

Execution was interrupted, reason: EXC_BAD_ACCESS (code=1, address=0xb06b9940)

I'm new to lldb and trying to diagnose an error by using po [$eax class]
The error shown in the UI is:
Thread 1: EXC_BREAKPOINT (code=EXC_i386_BPT, subcode=0x0)
Here is the lldb console including what I entered and what was returned:
(lldb) po [$eax class]
error: Execution was interrupted, reason: EXC_BAD_ACCESS (code=1, address=0xb06b9940).
The process has been returned to the state before expression evaluation.
The global breakpoint state toggle is off.
You app is getting stopped because the code you are running threw an uncaught Mach exception. Mach exceptions are the equivalent of BSD Signals for the Mach kernel - which makes up the lowest levels of the macOS operating system.
In this case, the particular Mach exception is EXC_BREAKPOINT. EXC_BREAKPOINT is a common source of confusion... Because it has the word "breakpoint" in the name people think that it is a debugger breakpoint. That's not entirely wrong, but the exception is used more generally than that.
EXC_BREAKPOINT is in fact the exception that the lower layers of Mach reports when it executes a certain instruction (a trap instruction). That trap instruction is used by lldb to implement breakpoints, but it is also used as an alternative to assert in various bits of system software. For instance, swift uses this error if you access past the end of an array. It is a way to stop your program right at the point of the error. If you are running outside the debugger, this will lead to a crash. But if you are running in the debugger, then control will be returned to the debugger with this EXC_BREAKPOINT stop reason.
To avoid confusion, lldb will never show you EXC_BREAKPOINT as the stop reason if the trap was one that lldb inserted in the program you are debugging to implement a debugger breakpoint. It will always say breakpoint n.n instead.
So if you see a thread stopped with EXC_BREAKPOINT as its stop reason, that means you've hit some kind of fatal error, usually in some system library used by your program. A backtrace at this point will show you what component is raising that error.
Anyway, then having hit that error, you tried to figure out the class of the value in the eax register by calling the class method on it by running po [$eax class]. Calling that method (which will cause code to get run in the program you are debugging) lead to a crash. That's what the "error" message you cite was telling you.
That's almost surely because $eax doesn't point to a valid ObjC object, so you're just calling a method on some random value, and that's crashing.
Note, if you are debugging a 64 bit program, then $eax is actually the lower 32 bits of the real argument passing register - $rax. The bottom 32 bits of a 64 bit pointer is unlikely to be a valid pointer value, so it is not at all surprising that calling class on it led to a crash.
If you were trying to call class on the first passed argument (self in ObjC methods) on 64 bit Intel, you really wanted to do:
(lldb) po [$rax class]
Note, that was also unlikely to work, since $rax only holds self at the start of the function. Then it gets used as a scratch register. So if you are any ways into the function (which the fact that your code fatally failed some test makes seem likely) $rax would be unlikely to still hold self.
Note also, if this is a 32 bit program, then $eax is not in fact used for argument passing - 32 bit Intel code passes arguments on the stack, not in registers.
Anyway, the first thing to do to figure out what went wrong was to print the backtrace when you get this exception, and see what code was getting run at the time this error occurred.
Clean project and restart Xcode worked for me.
I'm adding my solution, as I've struggled with the same problem and I didn't find this solution anywhere.
In my case I had to run Product -> Clean Build Folder (Clean + Option key) and rebuild my project. Breakpoints and lldb commands started to work properly.

"EXC_BAD_ACCESS" vs "Segmentation fault". Are both same practically?

In my first few dummy apps(for practice while learning) I have come across a lot of EXC_BAD_ACCESS, that somehow taught me Bad-Access is : You are touching/Accessing a object that you shouldn't because either it is not allocated yet or deallocated or simply you are not authorized to access it.
Look at this sample code that has bad-access issue because I am trying to modify a const :
-(void)myStartMethod{
NSString *str = #"testing";
const char *charStr = [str UTF8String];
charStr[4] = '\0'; // bad access on this line.
NSLog(#"%s",charStr);
}
While Segmentation fault says : Segmentation fault is a specific kind of error caused by accessing memory that “does not belong to you.” It’s a helper mechanism that keeps you from corrupting the memory and introducing hard-to-debug memory bugs. Whenever you get a segfault you know you are doing something wrong with memory (more description here.
I wanna know two things.
One, Am I right about objective-C's EXC_BAD_ACCESS ? Do I get it right ?
Second, Are EXC_BAD_ACCESS and Segmentation fault same things and Apple has just improvised its name?
No, EXC_BAD_ACCESS is not the same as SIGSEGV.
EXC_BAD_ACCESS is a Mach exception (A combination of Mach and xnu compose the Mac OS X kernel), while SIGSEGV is a POSIX signal. When crashes occur with cause given as EXC_BAD_ACCESS, often the signal is reported in parentheses immediately after: For instance, EXC_BAD_ACCESS(SIGSEGV). However, there is one other POSIX signal that can be seen in conjunction with EXC_BAD_ACCESS: It is SIGBUS, reported as EXC_BAD_ACCESS(SIGBUS).
SIGSEGV is most often seen when reading from/writing to an address that is not at all mapped in the memory map, like the NULL pointer, or attempting to write to a read-only memory location (as in your example above). SIGBUS on the other hand can be seen even for addresses the process has legitimate access to. For instance, SIGBUS can smite a process that dares to load/store from/to an unaligned memory address with instructions that assume an aligned address, or a process that attempts to write to a page for which it has not the privilege level to do so.
Thus EXC_BAD_ACCESS can best be understood as the set of both SIGSEGV and SIGBUS, and refers to all ways of incorrectly accessing memory (whether because said memory does not exist, or does exist but is misaligned, privileged or whatnot), hence its name: exception – bad access.
To feast your eyes, here is the code, within the xnu-1504.15.3 (Mac OS X 10.6.8 build 10K459) kernel source code, file bsd/uxkern/ux_exception.c beginning at line 429, that translates EXC_BAD_ACCESS to either SIGSEGV or SIGBUS.
/*
* ux_exception translates a mach exception, code and subcode to
* a signal and u.u_code. Calls machine_exception (machine dependent)
* to attempt translation first.
*/
static
void ux_exception(
int exception,
mach_exception_code_t code,
mach_exception_subcode_t subcode,
int *ux_signal,
mach_exception_code_t *ux_code)
{
/*
* Try machine-dependent translation first.
*/
if (machine_exception(exception, code, subcode, ux_signal, ux_code))
return;
switch(exception) {
case EXC_BAD_ACCESS:
if (code == KERN_INVALID_ADDRESS)
*ux_signal = SIGSEGV;
else
*ux_signal = SIGBUS;
break;
case EXC_BAD_INSTRUCTION:
*ux_signal = SIGILL;
break;
...
Edit in relation to another of your questions
Please note that exception here does not refer to an exception at the level of the language, of the type one may catch with syntactical sugar like try{} catch{} blocks. Exception here refers to the actions of a CPU on encountering certain types of mistakes in your program (they may or may not be be fatal), like a null-pointer dereference, that require outside intervention.
When this happens, the CPU is said to raise what is commonly called either an exception or an interrupt. This means that the CPU saves what it was doing (the context) and deals with the exceptional situation.
To deal with such an exceptional situation, the CPU does not start executing any "exception-handling" code (catch-blocks or suchlike) in your application. It first gives the OS control, by starting to execute a kernel-provided piece of code called an interrupt service routine. This is a piece of code that figures out what happened to which process, and what to do about it. The OS thus has an opportunity to judge the situation, and take the action it wants.
The action it does for an invalid memory access (such as a null pointer dereference) is to signal the guilty process with EXC_BAD_ACCESS(SIGSEGV). The action it does for a misaligned memory access is to signal the guilty process with EXC_BAD_ACCESS(SIGBUS). There are many other exceptional situations and corresponding actions, not all of which involve signals.
We're now back in the context of your program. If your program receives the SIGSEGV or SIGBUS signals, it will invoke the signal handler that was installed for that signal, or the default one if none was. It is rare for people to install custom handlers for SIGSEGV and SIGBUS and the default handlers shut down your program, so what you usually get is your program being shut down.
This sort of exceptions is therefore completely unlike the sort one throws in try{}-blocks and catch{}es. Those exceptions are handled purely within the application, without involving the OS at all. Here what happens is that a throw statement is simply a glorified jump to the inner-most catch block that handles that exception. As the exception bubbles through the stack, it unwinds the stack behind it, running destructors and suchlike as needed.
Basically yes, indeed an EXC_BAD_ACCESS is usually paired with a SIGSEGV which is a signal that warns about the segmentation failure.
A segmentation failure is risen whenever you are working with a pointer that points to invalid data (maybe not belonging to the process, maybe read-only, maybe an invalid address in general).
Don't think about the segmentation fault in terms of "accessing an object", you are accessing a memory location, so an address. That address must be considered coherent by the OS memory protection system.
Not all errors which are related to accessing invalid data can be tracked by the memory manager, think about a pointer to a stack allocated variable, which is considered valid although its content is not valid anymore upon restoring the stack frame.

Runtime Error MESSAGE_TYPE_X when using SAP on Ipad mini (IOS)

In my company we´ve installed the app GuiXT Liquid UI on Ipad mini to access to our SAP-Systems.
The login works fine and we can open transactions (those which were delivered by SAP and also self-written), but as soon as we want to change variant, display a list or anything else, an Runtime Error occurs.
While opening the same transactions with the “normal” gui on windows-pcs everything works.
Following informations I get from the error message:
Runtime Errors MESSAGE_TYPE_X
Error analysis
Short text of error message:
Control Frame Work : Error in data stream <DATAMANAGER><TABLES><DATACHAN
GES HANDLE="2"><IT I; current tag PROPERTY,
Long text of error message:
Technical information about the message:
Message class....... "CNDP"
Number.............. 008
Variable 1.......... "<DATAMANAGER><TABLES><DATACHANGES HANDLE="2"><IT I"
Variable 2.......... "PROPERTY"
Variable 3.......... " "
Variable 4.......... " "
Information on where terminated
Termination occurred in the ABAP program "CL_GUI_DATAMANAGER============CP" -
in "TRACE_XML".
The main program was "RAZUGA_ALV01 ".
In the source code you have the termination point in line 2136
of the (Include) program "CL_GUI_DATAMANAGER============CL".
This is a really old question and I hope you found a solution.
I stumbled upon it while searching for help on a similar (but not exactly the same) error.
I found this note which doesn't apply to my system/error, but might apply to yours, so I'll leave it here for reference:
2318244 - Shortdump occurs in /IDXGC/PDOCMON01 when click some Process Step No. to display step additional data
Regards,
tao

Resources