tbb::parallel_for running out of memory on machine with 80 cores

tbb::parallel_for running out of memory on machine with 80 cores - memory

I am trying to use tbb::parallel_for on a machine with 160 parallel threads (8 Intel E7-8870) and 0.5 TBytes of memory. It is a current Ubuntu system with kernel 3.2.0-35-generic #55-Ubuntu SMP. TBB is from the package libtbb2 Version 4.0+r233-1
Even with a very simple task, I tend to run out of resources, either "bad_alloc" or "thread_monitor Resource temporarily unavailable". I boiled it down to this very simple test:
#include <vector>
#include <cstdlib>
#include <cmath>
#include <iostream>
#include "tbb/tbb.h"
#include "tbb/task_scheduler_init.h"
using namespace tbb;
class Worker
{
std::vector<double>& dst;
public:
Worker(std::vector<double>& dst)
: dst(dst)
{}
void operator()(const blocked_range<size_t>& r ) const
{
for (size_t i=r.begin(); i!=r.end(); ++i)
dst[i] = std::sin(i);
}
};
int main(int argc, char** argv)
{
unsigned int n = 10000000;
unsigned int p = task_scheduler_init::default_num_threads();
std::cout << "Vector length: " << n << std::endl
<< "Processes : " << p << std::endl;
const size_t grain_size = n/p;
std::vector<double> src(n);
std::cerr << "Starting loop" << std::endl;
parallel_for(blocked_range<size_t>(0, n, grain_size), RandWorker(src));
std::cerr << "Loop finished" << std::endl;
}
Typical output is
Vector length: 10000000
Processes : 160
Starting loop
thread_monitor Resource temporarily unavailable
thread_monitor Resource temporarily unavailable
thread_monitor Resource temporarily unavailable
The errors appear randomly, and more frequent with greater n. The value of 10 million here is a point where they happen quite regularly. Nevertheless, given the machine characteristics, this should by far not exhaust the memory (I am using it alone for these tests).
The grain size was introduced after tbb created too many instances of the Worker, which made it fail for even smaller n.
Can anybody advise on how to set up tbb to handle large numbers of threads?

Summarizing the discussion in comments in an answer:
The message "thread_monitor Resource temporarily unavailable in pthread_create" basically tells that TBB cannot create enough threads; the "Resource temporarily unavailable" is what strerror() reports for the error code returned by pthread_create(). One possible reason for this error is insufficient memory to allocate stack for a new thread. By default, TBB requests 4M of stack for a worker thread; this value can be adjusted with a parameter to tbb::task_scheduler_init constructor if necessary.
In this particular case, as Guido Kanschat reported, the problem was caused by ulimit accidentally set which limited the memory available for the process.

Related

How to Trap Floating-Point Exceptions On Rosetta 2

In a related question, How to trap floating-point exceptions on M1 Macs?, someone wanted to understand how to make the following code work natively on macOS hosted by a machine using the M1 processor:
#include <cmath> // for sqrt()
#include <csignal> // for signal()
#include <iostream>
#include <xmmintrin.h> // for _mm_setcsr
void fpe_signal_handler(int /*signal*/) {
std::cerr << "Floating point exception!\n";
exit(1);
}
void enable_floating_point_exceptions() {
_mm_setcsr(_MM_MASK_MASK & ~_MM_MASK_INVALID);
signal(SIGFPE, fpe_signal_handler);
}
int main() {
const double x{-1.0};
std::cout << sqrt(x) << "\n";
enable_floating_point_exceptions();
std::cout << sqrt(x) << "\n";
}
I am looking at this from another angle, and want to understand why it doesn't work using Rosetta 2. I compiled it using the following command:
clang++ -g -std=c++17 -arch x86_64 -o fpe fpe.cpp
When I run it, I see the following output:
nan
nan
Mind you, when I do the same thing on a Intel-based Mac, I see the following output:
nan
Floating point exception!
Does anyone know if it is possible to trap floating-point exceptions on Rosetta 2?

Considering the difference in trapping on Intel using:
_mm_setcsr(_MM_MASK_MASK & ~_MM_MASK_INVALID);
and trapping on Apple Silicon using:
fegetenv(&env);
env.__fpcr = env.__fpcr | __fpcr_trap_invalid;
fesetenv(&env);
it seems more likely it is a bug in the Rosetta implementation.

How to open and use huge page and transparent huge page in code on Ubuntu

I want to use huge page or transparent huge page in my code to optimize the performance of data structure. But when I use the madvise() in my code, it Can allocate memory for me.
There is always [madvise] never in /sys/kernel/mm/transparent_hugepage/enabled.
There is always defer defer+madvise [madvise] never in /sys/kernel/mm/transparent_hugepage/defrag.
#include <iostream>
#include <sys/mman.h>
#include <string.h>
int main()
{
void* ptr;
std::cout << madvise(ptr, 1, MADV_HUGEPAGE) << std::endl;
std::cout << strerror(errno) << std::endl;
return 0;
}
The result of the above code is:
-1
Cannot allocate memory

Problems with the provided code example in the question
On my system, your code prints:
-1
Invalid argument
And I don't see how it would work in the first place. madvise does not allocate memory for you, it it used to set policies for existing memory ranges. Therefore, specifying an uninitialized pointer as the first argument is not gonna work.
There exists documentation for the MADV_HUGEPAGE argument in the madvise manual:
Enable Transparent Huge Pages (THP) for pages in the range
specified by addr and length. Currently, Transparent Huge
Pages work only with private anonymous pages (see
mmap(2)). The kernel will regularly scan the areas marked
as huge page candidates to replace them with huge pages.
The kernel will also allocate huge pages directly when the
region is naturally aligned to the huge page size (see
posix_memalign(2)).
How to use permanently reserved huge pages
Here is a rewritten code that uses mmap instead of mavise. With that I can reproduce your error of Cannot allocate memory:
#include <iostream>
#include <sys/mman.h>
int main()
{
const auto memorySize = 16ULL * 1024ULL * 1024ULL;
void* data = mmap(
/* "If addr is NULL, then the kernel chooses the (page-aligned) address at which to create the mapping" */
nullptr,
memorySize,
/* memory protection / permissions */ PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
/* fd should for compatibility be -1 even though it is ignored for MAP_ANONYMOUS */ -1,
/* "The offset argument should be zero [when using MAP_ANONYMOUS]." */ 0
);
if ( data == MAP_FAILED ) {
std::cout << "Failed to allocate memory: " << strerror( errno ) << "\n";
} else {
std::cout << "Allocated pointer at: " << data << "\n";
}
munmap( data, memorySize );
return 0;
}
That error can be solved by actually making the kernel reserve some huge pages that can be allocated. Normally, this should be done during boot time when most memory is unused for better success but in my case, I was able to allocate 37 huge pages with 2 MiB, i.e., 74 MiB of memory. I find that surprisingly low because I have 370 MiB "free" and 3.9 GiB "available" memory. Maybe I should close firefox first and then try to reserve more huge pages or maybe kswapd can somehow be triggered to defragment memory before reserving more huge pages.
echo 128 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
head /sys/kernel/mm/hugepages/hugepages-2048kB/*
Output:
==> /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages <==
37
==> /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages <==
37
==> /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages_mempolicy <==
37
==> /sys/kernel/mm/hugepages/hugepages-2048kB/nr_overcommit_hugepages <==
0
==> /sys/kernel/mm/hugepages/hugepages-2048kB/resv_hugepages <==
0
==> /sys/kernel/mm/hugepages/hugepages-2048kB/surplus_hugepages <==
0
Now when I run the code snipped with clang++ hugePages.cpp && ./a.out, I get this output:
Allocated pointer at: 0x7f4454e00000
As can be seen from the trailing zeros, it is aligned to quite a large alignment value of 2 MiB.
How to use transparent huge pages
I have not seen any system actually using these fixed reserved huge pages. It seems that transparent huge pages have superseded that usage. Probably, partly because:
Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. Huge pages cannot be swapped out under memory pressure.
To mitigate these complexities, transparent huge pages were introduced:
No application changes need to be made to take advantage of THP, but interested application developers can try to optimize their use of it. A call to madvise() with the MADV_HUGEPAGE flag will mark a memory range as being especially suited to huge pages, while MADV_NOHUGEPAGE will suggest that huge pages are better used elsewhere. For applications that want to use huge pages, use of posix_memalign() can help to ensure that large allocations are aligned to huge page (2MB) boundaries.
That basically says it all but I think the first statement is not true anymore because most systems nowadays are configured to madvise in /sys/kernel/mm/transparent_hugepage/enabled instead of always, for which the statement probably was intended for. So, here is another try with madvise:
#include <array>
#include <chrono>
#include <fstream>
#include <iostream>
#include <string_view>
#include <thread>
#include <stdlib.h>
#include <string.h> // streerror
#include <sys/mman.h>
int main()
{
const auto memorySize = 16ULL * 1024ULL * 1024ULL;
void* data{ nullptr };
const auto memalignError = posix_memalign(
&data, /* alignment equal or higher to huge page size */ 2ULL * 1024ULL * 1024ULL, memorySize );
if ( memalignError != 0 ) {
std::cout << "Failed to allocate memory: " << strerror( memalignError ) << "\n";
return 1;
}
std::cout << "Allocated pointer at: " << data << "\n";
if ( madvise( data, memorySize, MADV_HUGEPAGE ) != 0 ) {
std::cerr << "Error on madvise: " << strerror( errno ) << "\n";
return 2;
}
const auto intData = reinterpret_cast<int*>( data );
intData[0] = 3;
/* This access is at offset 3000 * 8 = 24 kB, i.e.,
* still in the same 2 MiB page as the access above */
intData[3000] = 3;
intData[memorySize / sizeof( int ) / 2] = 3;
/* Check whether transparent huge pages have been allocated. */
std::ifstream smapsFile( "/proc/self/smaps" );
std::array<char, 4096> lineBuffer;
while ( smapsFile.good() ) {
/* Getline always appends null. */
smapsFile.getline( lineBuffer.data(), lineBuffer.size(), '\n' );
std::string_view line{ lineBuffer.data() };
if ( line.starts_with( "AnonHugePages:" ) && !line.contains( " 0 kB" ) ) {
std::cout << "We are successfully using transparent huge pages!\n " << line << "\n";
}
}
/* During this sleep /proc/meminfo and /proc/vmstat can be checked for transparent anonymous huge pages. */
using namespace std::chrono_literals;
std::this_thread::sleep_for( 100s );
free( data );
return intData[3000] == 3 ? 0 : 3;
}
Running this with clang++ -std=c++2b hugeTransparentPages.cpp && ./a.out (C++23 is necessary for the string_view functionalities like contains), the output on my system is:
Allocated pointer at: 0x7f38cd600000
We are successfully using transparent huge pages!
AnonHugePages: 4096 kB
And this test was executed while cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages yields 0, i.e., there are no persistently reserved huge pages.
Note that only two pages (4096 kB) out of the requested 16 MiB were actually used because the other pages have not been written to. This is also why the call to madvise is possible and yields huge pages. It has to be done before the actual physical allocation, i.e., before writing to the allocated memory.
The example code includes a check for transparent huge pages for the process itself. This site lists multiple ways to check the amount of anonymous transparent huge pages that are in use. For example, you can check system-wide with:
grep AnonHugePages /proc/meminfo
What I find interesting is that normally, this is 0 kB on my system and while the example code with madvise is running it yields 4096 kB.
To me, it seems like this means that none of my normally used programs use any persistent huge pages and also no transparent huge pages. I find that very surprising because there should be a lot of use cases for which huge page advantages should outstrip their disadvantages (wasted memory).

Tracking stack size during application execution

I am running an application on an embedded (PowerPC 32 bit) system where there is a stack size limitation of 64K. I am experiencing some occasional crashes because of stack overflow.
I can build the application also for a normal Linux system (with some minor little changes in the code), so I can run an emulation on my development environment.
I was wondering which is the best way to find the methods that exceed the stack size limitation and which is the stack frame when this happens (in order to perform some code refactoring).
I've already tried Callgrind (a Valgrind tool), but it seems not to be the right tool.
I'm looking more for a tool than changes in the code (since it's a 200K LOC and 100 files project).
The application is entirely written in C++03.

While it seems that there should be an existing tool for this, I would approach it by writing a small macro and adding it to the top of suspected functions:
char *__stack_root__;
#define GUARD_SIZE (64 * 1024 - 1024)
#define STACK_ROOT \
char __stack_frame__; \
__stack_root__ = &__stack_frame__;
#define STACK_GUARD \
char __stack_frame__; \
if (abs(&__stack_frame__ - __stack_root__) > GUARD_SIZE) { \
printf("stack is about to overflow in %s: at %d bytes\n", __FUNCTION__, abs(&__stack_frame__ - __stack_root__)); \
}
And here how to use it:
#include <stdio.h>
#include <stdlib.h>
void foo(int);
int main(int argc, char** argv) {
STACK_ROOT; // this macro records the top of the bottom of the thread's stack
foo(10000);
return 0;
}
void foo(int x) {
STACK_GUARD; // this macro checks if we're approaching the end of memory available for stack
if (x > 0) {
foo(x - 1);
}
}
couple notes here:
this code assumes single thread. If you have multiple threads, you need to keep track of individual __stack_frame__ variables, one per thread. Use thread local storage for this
using abs() to make sure the macro works both when PowerPC grows its stack up, as well as down (it can: depends on your setup)
adjust the GUARD_SIZE to your liking, but keep it smaller than the max size of your stack on the target

Loading/Storing to XMFLOAT4 faster than using XMVECTOR?

I'm going through the DirectX Math/XNA Math library, and I got curious when I read about the alignment requirements for XMVECTOR (Now DirectX::XMVECTOR), and how it is expected of you to use XMFLOAT* for members instead, using XMLoad* and XMStore* when performing mathematical operations. I was specifically curious about the tradeoffs, so I did an experiment, as I'm sure many others have, and tested to see exactly how much you lose having to load and store the vectors for each operation. This is the resulting code:
#include <Windows.h>
#include <chrono>
#include <cstdint>
#include <DirectXMath.h>
#include <iostream>
using std::chrono::high_resolution_clock;
#define TEST_COUNT 1000000000l
int main(void)
{
DirectX::XMVECTOR v1 = DirectX::XMVectorSet(1, 2, 3, 4);
DirectX::XMVECTOR v2 = DirectX::XMVectorSet(2, 3, 4, 5);
DirectX::XMFLOAT4 x{ 1, 2, 3, 4 };
DirectX::XMFLOAT4 y{ 2, 3, 4, 5 };
std::chrono::system_clock::time_point start, end;
std::chrono::milliseconds duration;
// Test with just the XMVECTOR
start = high_resolution_clock::now();
for (uint64_t i = 0; i < TEST_COUNT; i++)
{
v1 = DirectX::XMVectorAdd(v1, v2);
}
end = high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
DirectX::XMFLOAT4 z;
DirectX::XMStoreFloat4(&z, v1);
std::cout << std::endl << "z = " << z.x << ", " << z.y << ", " << z.z << std::endl;
std::cout << duration.count() << " milliseconds" << std::endl;
// Now try with load/store
start = high_resolution_clock::now();
for (uint64_t i = 0; i < TEST_COUNT; i++)
{
v1 = DirectX::XMLoadFloat4(&x);
v2 = DirectX::XMLoadFloat4(&y);
v1 = DirectX::XMVectorAdd(v1, v2);
DirectX::XMStoreFloat4(&x, v1);
}
end = high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << std::endl << "x = " << x.x << ", " << x.y << ", " << x.z << std::endl;
std::cout << duration.count() << " milliseconds" << std::endl;
}
Running a debug build yields the output:
z = 3.35544e+007, 6.71089e+007, 6.71089e+007
25817 milliseconds
x = 3.35544e+007, 6.71089e+007, 6.71089e+007
84344 milliseconds
Okay, so about thrice as slow, but does anyone really take perf tests on debug builds seriously? Here are the results when I do a release build:
z = 3.35544e+007, 6.71089e+007, 6.71089e+007
1980 milliseconds
x = 3.35544e+007, 6.71089e+007, 6.71089e+007
670 milliseconds
Like magic, XMFLOAT4 runs almost three times faster! Somehow the tables have turned. Looking at the code, this makes no sense to me; the second part runs a superset of the commands that the first part runs! There must be something going wrong, or something I am not taking into account. It is difficult to believe that the compiler was able to optimize the second part nine-fold over the much simpler, and theoretically more efficient first part. The only reasonable explanations I have involve either (1) cache behavior, (2) some crazy out of order execution that XMVECTOR can't take advantage of, (3) the compiler is making some insane optimizations, or (4) using XMVECTOR directly has some implicit inefficiency that was able to be optimized away when using XMFLOAT4. That is, the default way the compiler loads and stores XMVECTORs from memory is less efficient than XMLoad* and XMStore*. I attempted to inspect the disassembly, but I'm not all that familiar with X86 and/or SSE2 and Visual Studio does some crazy optimizations making it difficult to follow along with the source code. I also tried the Visual Studio performance analysis tool, but that didn't help as I can't figure out how to make it show the disassembly instead of the code. The only useful information I get out of that is that the first call to XMVectorAdd accounts for ~48.6% of all cycles while the second call to XMVectorAdd accounts for ~4.4% of all cycles.
EDIT:
After some more debugging, here is the assembly for the code that gets run inside of the loop. For the first part:
002912E0 movups xmm1,xmmword ptr [esp+18h] <-- HERE
002912E5 add ecx,1
002912E8 movaps xmm0,xmm2 <-- HERE
002912EB adc esi,0
002912EE addps xmm0,xmm1
002912F1 movups xmmword ptr [esp+18h],xmm0 <-- HERE
002912F6 jne main+60h (0291300h)
002912F8 cmp ecx,3B9ACA00h
002912FE jb main+40h (02912E0h)
And for the second part:
00291400 add ecx,1
00291403 addps xmm0,xmm1
00291406 adc esi,0
00291409 jne main+173h (0291413h)
0029140B cmp ecx,3B9ACA00h
00291411 jb main+160h (0291400h)
Note that these two loops are indeed nearly identical. The only difference is that the first for loop appears to be the one doing the loading and storing! It would appear as though Visual Studio made a ton of optimizations since x and y were on the stack. Changing them both to be on the heap (thus the writes must happen), and the machine code is now identical. Is this generally the case? Is there really no negative side effect to using the storage classes? Other than the fully optimized versions I suppose.

If you define
DirectX::XMVECTOR v3 = DirectX::XMVectorSet(2, 3, 4, 5);
and use v3 instead v1 as a result:
...
for (uint64_t i = 0; i < TEST_COUNT; i++)
{
v3 = DirectX::XMVectorAdd(v1, v2);
}
you got code faster then 2-nd part code using XMLoadFloat4 and XMStoreFloat4

Firstly, don't use Visual Studio's "high-resolution clock" for perf timing. You should use QueryPerformanceCounter instead. See Connect.
SIMD performance is difficult to measure in these micro tests because the overhead of loading up vector data can often dominate with such trivial ALU usage. You really need to do something substantial with the data to see the benefits. Also keep in mind that depending on your compiler settings, the compiler itself may be using the 'scalar' SIMD functionality or even auto-vectoring such trivial code loops.
You are also seeing some issues with the way you are generating your test data. You should create something larger than a single vector on the heap and use that as your source/dest.
PS: The best way to create 'static' XMVECTOR data is to use the XMVECTORF32 type.
static const DirectX::XMVECTORF32 v1 = { 1, 2, 3, 4 };
Note that if you want to have the load/save conversions between XMVECTOR and XMFLOATx to be "automatic", take a look at SimpleMath in the DirectX Tool Kit. You just use types like SimpleMath::Vector4 in your data structures, and the implicit conversion operators take care of calling XMLoadFloat4 / XMStoreFloat4 for you.

malloc() not showing up in System Monitor

I wrote a program which only purpose was to allocate a given amount of memory so that I could see its effect on the Ubuntu 12.04 System Monitor.
This is what I wrote
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char **argv)
{
if(argc<2)
exit(-1);
int *mem=0;
int kB = 1024*atoi(argv[1]);
mem = malloc(kB);
if(mem==NULL)
exit(-2);
sleep(3);
free(mem);
exit(0);
}
I used sleep() so that the program would not end immediately so that System Monitor would have time to show the change in memory.
What happens (and I'm puzzled about) is that even if I allocate 1GB, the System Monitor shows no memory change at all! Why is that so?
I thought the reason could be because the allocated memory was never accessed so I tried to insert the following before sleep()
int i;
for(i=0; i<kB/sizeof(int); i++)
mem[i]=i;
But this too had no effect on the graph in System Monitor.
Thanks.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

tbb::parallel_for running out of memory on machine with 80 cores - memory

Related

How to Trap Floating-Point Exceptions On Rosetta 2

How to open and use huge page and transparent huge page in code on Ubuntu

Tracking stack size during application execution

Loading/Storing to XMFLOAT4 faster than using XMVECTOR?

malloc() not showing up in System Monitor

Categories

Resources