How to open and use huge page and transparent huge page in code on Ubuntu - huge-pages

I want to use huge page or transparent huge page in my code to optimize the performance of data structure. But when I use the madvise() in my code, it Can allocate memory for me.
There is always [madvise] never in /sys/kernel/mm/transparent_hugepage/enabled.
There is always defer defer+madvise [madvise] never in /sys/kernel/mm/transparent_hugepage/defrag.
#include <iostream>
#include <sys/mman.h>
#include <string.h>
int main()
{
void* ptr;
std::cout << madvise(ptr, 1, MADV_HUGEPAGE) << std::endl;
std::cout << strerror(errno) << std::endl;
return 0;
}
The result of the above code is:
-1
Cannot allocate memory

Problems with the provided code example in the question
On my system, your code prints:
-1
Invalid argument
And I don't see how it would work in the first place. madvise does not allocate memory for you, it it used to set policies for existing memory ranges. Therefore, specifying an uninitialized pointer as the first argument is not gonna work.
There exists documentation for the MADV_HUGEPAGE argument in the madvise manual:
Enable Transparent Huge Pages (THP) for pages in the range
specified by addr and length. Currently, Transparent Huge
Pages work only with private anonymous pages (see
mmap(2)). The kernel will regularly scan the areas marked
as huge page candidates to replace them with huge pages.
The kernel will also allocate huge pages directly when the
region is naturally aligned to the huge page size (see
posix_memalign(2)).
How to use permanently reserved huge pages
Here is a rewritten code that uses mmap instead of mavise. With that I can reproduce your error of Cannot allocate memory:
#include <iostream>
#include <sys/mman.h>
int main()
{
const auto memorySize = 16ULL * 1024ULL * 1024ULL;
void* data = mmap(
/* "If addr is NULL, then the kernel chooses the (page-aligned) address at which to create the mapping" */
nullptr,
memorySize,
/* memory protection / permissions */ PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
/* fd should for compatibility be -1 even though it is ignored for MAP_ANONYMOUS */ -1,
/* "The offset argument should be zero [when using MAP_ANONYMOUS]." */ 0
);
if ( data == MAP_FAILED ) {
std::cout << "Failed to allocate memory: " << strerror( errno ) << "\n";
} else {
std::cout << "Allocated pointer at: " << data << "\n";
}
munmap( data, memorySize );
return 0;
}
That error can be solved by actually making the kernel reserve some huge pages that can be allocated. Normally, this should be done during boot time when most memory is unused for better success but in my case, I was able to allocate 37 huge pages with 2 MiB, i.e., 74 MiB of memory. I find that surprisingly low because I have 370 MiB "free" and 3.9 GiB "available" memory. Maybe I should close firefox first and then try to reserve more huge pages or maybe kswapd can somehow be triggered to defragment memory before reserving more huge pages.
echo 128 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
head /sys/kernel/mm/hugepages/hugepages-2048kB/*
Output:
==> /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages <==
37
==> /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages <==
37
==> /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages_mempolicy <==
37
==> /sys/kernel/mm/hugepages/hugepages-2048kB/nr_overcommit_hugepages <==
0
==> /sys/kernel/mm/hugepages/hugepages-2048kB/resv_hugepages <==
0
==> /sys/kernel/mm/hugepages/hugepages-2048kB/surplus_hugepages <==
0
Now when I run the code snipped with clang++ hugePages.cpp && ./a.out, I get this output:
Allocated pointer at: 0x7f4454e00000
As can be seen from the trailing zeros, it is aligned to quite a large alignment value of 2 MiB.
How to use transparent huge pages
I have not seen any system actually using these fixed reserved huge pages. It seems that transparent huge pages have superseded that usage. Probably, partly because:
Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. Huge pages cannot be swapped out under memory pressure.
To mitigate these complexities, transparent huge pages were introduced:
No application changes need to be made to take advantage of THP, but interested application developers can try to optimize their use of it. A call to madvise() with the MADV_HUGEPAGE flag will mark a memory range as being especially suited to huge pages, while MADV_NOHUGEPAGE will suggest that huge pages are better used elsewhere. For applications that want to use huge pages, use of posix_memalign() can help to ensure that large allocations are aligned to huge page (2MB) boundaries.
That basically says it all but I think the first statement is not true anymore because most systems nowadays are configured to madvise in /sys/kernel/mm/transparent_hugepage/enabled instead of always, for which the statement probably was intended for. So, here is another try with madvise:
#include <array>
#include <chrono>
#include <fstream>
#include <iostream>
#include <string_view>
#include <thread>
#include <stdlib.h>
#include <string.h> // streerror
#include <sys/mman.h>
int main()
{
const auto memorySize = 16ULL * 1024ULL * 1024ULL;
void* data{ nullptr };
const auto memalignError = posix_memalign(
&data, /* alignment equal or higher to huge page size */ 2ULL * 1024ULL * 1024ULL, memorySize );
if ( memalignError != 0 ) {
std::cout << "Failed to allocate memory: " << strerror( memalignError ) << "\n";
return 1;
}
std::cout << "Allocated pointer at: " << data << "\n";
if ( madvise( data, memorySize, MADV_HUGEPAGE ) != 0 ) {
std::cerr << "Error on madvise: " << strerror( errno ) << "\n";
return 2;
}
const auto intData = reinterpret_cast<int*>( data );
intData[0] = 3;
/* This access is at offset 3000 * 8 = 24 kB, i.e.,
* still in the same 2 MiB page as the access above */
intData[3000] = 3;
intData[memorySize / sizeof( int ) / 2] = 3;
/* Check whether transparent huge pages have been allocated. */
std::ifstream smapsFile( "/proc/self/smaps" );
std::array<char, 4096> lineBuffer;
while ( smapsFile.good() ) {
/* Getline always appends null. */
smapsFile.getline( lineBuffer.data(), lineBuffer.size(), '\n' );
std::string_view line{ lineBuffer.data() };
if ( line.starts_with( "AnonHugePages:" ) && !line.contains( " 0 kB" ) ) {
std::cout << "We are successfully using transparent huge pages!\n " << line << "\n";
}
}
/* During this sleep /proc/meminfo and /proc/vmstat can be checked for transparent anonymous huge pages. */
using namespace std::chrono_literals;
std::this_thread::sleep_for( 100s );
free( data );
return intData[3000] == 3 ? 0 : 3;
}
Running this with clang++ -std=c++2b hugeTransparentPages.cpp && ./a.out (C++23 is necessary for the string_view functionalities like contains), the output on my system is:
Allocated pointer at: 0x7f38cd600000
We are successfully using transparent huge pages!
AnonHugePages: 4096 kB
And this test was executed while cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages yields 0, i.e., there are no persistently reserved huge pages.
Note that only two pages (4096 kB) out of the requested 16 MiB were actually used because the other pages have not been written to. This is also why the call to madvise is possible and yields huge pages. It has to be done before the actual physical allocation, i.e., before writing to the allocated memory.
The example code includes a check for transparent huge pages for the process itself. This site lists multiple ways to check the amount of anonymous transparent huge pages that are in use. For example, you can check system-wide with:
grep AnonHugePages /proc/meminfo
What I find interesting is that normally, this is 0 kB on my system and while the example code with madvise is running it yields 4096 kB.
To me, it seems like this means that none of my normally used programs use any persistent huge pages and also no transparent huge pages. I find that very surprising because there should be a lot of use cases for which huge page advantages should outstrip their disadvantages (wasted memory).

Related

Tracking stack size during application execution

I am running an application on an embedded (PowerPC 32 bit) system where there is a stack size limitation of 64K. I am experiencing some occasional crashes because of stack overflow.
I can build the application also for a normal Linux system (with some minor little changes in the code), so I can run an emulation on my development environment.
I was wondering which is the best way to find the methods that exceed the stack size limitation and which is the stack frame when this happens (in order to perform some code refactoring).
I've already tried Callgrind (a Valgrind tool), but it seems not to be the right tool.
I'm looking more for a tool than changes in the code (since it's a 200K LOC and 100 files project).
The application is entirely written in C++03.
While it seems that there should be an existing tool for this, I would approach it by writing a small macro and adding it to the top of suspected functions:
char *__stack_root__;
#define GUARD_SIZE (64 * 1024 - 1024)
#define STACK_ROOT \
char __stack_frame__; \
__stack_root__ = &__stack_frame__;
#define STACK_GUARD \
char __stack_frame__; \
if (abs(&__stack_frame__ - __stack_root__) > GUARD_SIZE) { \
printf("stack is about to overflow in %s: at %d bytes\n", __FUNCTION__, abs(&__stack_frame__ - __stack_root__)); \
}
And here how to use it:
#include <stdio.h>
#include <stdlib.h>
void foo(int);
int main(int argc, char** argv) {
STACK_ROOT; // this macro records the top of the bottom of the thread's stack
foo(10000);
return 0;
}
void foo(int x) {
STACK_GUARD; // this macro checks if we're approaching the end of memory available for stack
if (x > 0) {
foo(x - 1);
}
}
couple notes here:
this code assumes single thread. If you have multiple threads, you need to keep track of individual __stack_frame__ variables, one per thread. Use thread local storage for this
using abs() to make sure the macro works both when PowerPC grows its stack up, as well as down (it can: depends on your setup)
adjust the GUARD_SIZE to your liking, but keep it smaller than the max size of your stack on the target

Loading/Storing to XMFLOAT4 faster than using XMVECTOR?

I'm going through the DirectX Math/XNA Math library, and I got curious when I read about the alignment requirements for XMVECTOR (Now DirectX::XMVECTOR), and how it is expected of you to use XMFLOAT* for members instead, using XMLoad* and XMStore* when performing mathematical operations. I was specifically curious about the tradeoffs, so I did an experiment, as I'm sure many others have, and tested to see exactly how much you lose having to load and store the vectors for each operation. This is the resulting code:
#include <Windows.h>
#include <chrono>
#include <cstdint>
#include <DirectXMath.h>
#include <iostream>
using std::chrono::high_resolution_clock;
#define TEST_COUNT 1000000000l
int main(void)
{
DirectX::XMVECTOR v1 = DirectX::XMVectorSet(1, 2, 3, 4);
DirectX::XMVECTOR v2 = DirectX::XMVectorSet(2, 3, 4, 5);
DirectX::XMFLOAT4 x{ 1, 2, 3, 4 };
DirectX::XMFLOAT4 y{ 2, 3, 4, 5 };
std::chrono::system_clock::time_point start, end;
std::chrono::milliseconds duration;
// Test with just the XMVECTOR
start = high_resolution_clock::now();
for (uint64_t i = 0; i < TEST_COUNT; i++)
{
v1 = DirectX::XMVectorAdd(v1, v2);
}
end = high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
DirectX::XMFLOAT4 z;
DirectX::XMStoreFloat4(&z, v1);
std::cout << std::endl << "z = " << z.x << ", " << z.y << ", " << z.z << std::endl;
std::cout << duration.count() << " milliseconds" << std::endl;
// Now try with load/store
start = high_resolution_clock::now();
for (uint64_t i = 0; i < TEST_COUNT; i++)
{
v1 = DirectX::XMLoadFloat4(&x);
v2 = DirectX::XMLoadFloat4(&y);
v1 = DirectX::XMVectorAdd(v1, v2);
DirectX::XMStoreFloat4(&x, v1);
}
end = high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << std::endl << "x = " << x.x << ", " << x.y << ", " << x.z << std::endl;
std::cout << duration.count() << " milliseconds" << std::endl;
}
Running a debug build yields the output:
z = 3.35544e+007, 6.71089e+007, 6.71089e+007
25817 milliseconds
x = 3.35544e+007, 6.71089e+007, 6.71089e+007
84344 milliseconds
Okay, so about thrice as slow, but does anyone really take perf tests on debug builds seriously? Here are the results when I do a release build:
z = 3.35544e+007, 6.71089e+007, 6.71089e+007
1980 milliseconds
x = 3.35544e+007, 6.71089e+007, 6.71089e+007
670 milliseconds
Like magic, XMFLOAT4 runs almost three times faster! Somehow the tables have turned. Looking at the code, this makes no sense to me; the second part runs a superset of the commands that the first part runs! There must be something going wrong, or something I am not taking into account. It is difficult to believe that the compiler was able to optimize the second part nine-fold over the much simpler, and theoretically more efficient first part. The only reasonable explanations I have involve either (1) cache behavior, (2) some crazy out of order execution that XMVECTOR can't take advantage of, (3) the compiler is making some insane optimizations, or (4) using XMVECTOR directly has some implicit inefficiency that was able to be optimized away when using XMFLOAT4. That is, the default way the compiler loads and stores XMVECTORs from memory is less efficient than XMLoad* and XMStore*. I attempted to inspect the disassembly, but I'm not all that familiar with X86 and/or SSE2 and Visual Studio does some crazy optimizations making it difficult to follow along with the source code. I also tried the Visual Studio performance analysis tool, but that didn't help as I can't figure out how to make it show the disassembly instead of the code. The only useful information I get out of that is that the first call to XMVectorAdd accounts for ~48.6% of all cycles while the second call to XMVectorAdd accounts for ~4.4% of all cycles.
EDIT:
After some more debugging, here is the assembly for the code that gets run inside of the loop. For the first part:
002912E0 movups xmm1,xmmword ptr [esp+18h] <-- HERE
002912E5 add ecx,1
002912E8 movaps xmm0,xmm2 <-- HERE
002912EB adc esi,0
002912EE addps xmm0,xmm1
002912F1 movups xmmword ptr [esp+18h],xmm0 <-- HERE
002912F6 jne main+60h (0291300h)
002912F8 cmp ecx,3B9ACA00h
002912FE jb main+40h (02912E0h)
And for the second part:
00291400 add ecx,1
00291403 addps xmm0,xmm1
00291406 adc esi,0
00291409 jne main+173h (0291413h)
0029140B cmp ecx,3B9ACA00h
00291411 jb main+160h (0291400h)
Note that these two loops are indeed nearly identical. The only difference is that the first for loop appears to be the one doing the loading and storing! It would appear as though Visual Studio made a ton of optimizations since x and y were on the stack. Changing them both to be on the heap (thus the writes must happen), and the machine code is now identical. Is this generally the case? Is there really no negative side effect to using the storage classes? Other than the fully optimized versions I suppose.
If you define
DirectX::XMVECTOR v3 = DirectX::XMVectorSet(2, 3, 4, 5);
and use v3 instead v1 as a result:
...
for (uint64_t i = 0; i < TEST_COUNT; i++)
{
v3 = DirectX::XMVectorAdd(v1, v2);
}
you got code faster then 2-nd part code using XMLoadFloat4 and XMStoreFloat4
Firstly, don't use Visual Studio's "high-resolution clock" for perf timing. You should use QueryPerformanceCounter instead. See Connect.
SIMD performance is difficult to measure in these micro tests because the overhead of loading up vector data can often dominate with such trivial ALU usage. You really need to do something substantial with the data to see the benefits. Also keep in mind that depending on your compiler settings, the compiler itself may be using the 'scalar' SIMD functionality or even auto-vectoring such trivial code loops.
You are also seeing some issues with the way you are generating your test data. You should create something larger than a single vector on the heap and use that as your source/dest.
PS: The best way to create 'static' XMVECTOR data is to use the XMVECTORF32 type.
static const DirectX::XMVECTORF32 v1 = { 1, 2, 3, 4 };
Note that if you want to have the load/save conversions between XMVECTOR and XMFLOATx to be "automatic", take a look at SimpleMath in the DirectX Tool Kit. You just use types like SimpleMath::Vector4 in your data structures, and the implicit conversion operators take care of calling XMLoadFloat4 / XMStoreFloat4 for you.

Process memory limit and address space for UNIX and Linux and Windows

what is the maximum amount of memory for a single process in UNIX and Linux and windows? how to calculate that? How much user address space and kernel address space for 4 GB of RAM?
How much user address space and kernel address space for 4 GB of RAM?
The address space of a process is divided into two parts,
User space: On standard 32 bit x86_64 architecture,the maximum addressable memory is 4GB, out of which addresses from 0x00000000 to 0xbfffffff = (3GB) meant for code, data segments. This region can be addressed when user process executing either in user or kernel mode.
Kernel space: Similarly, addresses from 0xc0000000 to 0xffffffff = (1GB) are meant for virtual address space of the kernel and can only addressed when the process executes in kernel mode.
This particular address space split on x86 is determined by the value of PAGE_OFFSET. Referring to Linux 3.11.1v page_32_types.h and page_64_types.h, page offset is defined as below,
#define __PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
Where Kconfig defines a default value of default 0xC0000000 also with other address split options available.
Similarly for 64 bit,
#define __PAGE_OFFSET _AC(0xffff880000000000, UL).
On 64 bit architecture 3G/1G split doesn't hold anymore due to huge address space. As per the source latest Linux version has given above offset as offset.
When I see my 64 bit x86_64 architecture, a 32 bit process can have entire 4GB of user address space and kernel will hold address range above 4GB. Interestingly on modern 64 bit x86_64 CPU's not all address lines are enabled(or the address bus is not large enough) to provide us 2^64 = 16 exabytes of virtual address space. Perhaps AMD64/x86 architectures has 48/42 lower bits enabled respectively resulting to 2^48 = 256TB / 2^42= 4TB of address space. Now this definitely improves performance with large amount of RAM, at the same time question arises how it is efficiently managed with the OS limitations.
In Linux there's a way to find out the limit of address space you can have.
Using the rlimit structure.
struct rlimit {
rlim_t cur; //current limit
rlim_t max; //ceiling for cur.
}
rlim_t is a unsigned long type.
and you can have something like:
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
//Bytes To GigaBytes
static inline unsigned long btogb(unsigned long bytes) {
return bytes / (1024 * 1024 * 1024);
}
//Bytes To ExaBytes
static inline double btoeb(double bytes) {
return bytes / (1024.00 * 1024.00 * 1024.00 * 1024.00 * 1024.00 * 1024.00);
}
int main() {
printf("\n");
struct rlimit rlim_addr_space;
rlim_t addr_space;
/*
* Here we call to getrlimit(), with RLIMIT_AS (Address Space) and
* a pointer to our instance of rlimit struct.
*/
int retval = getrlimit(RLIMIT_AS, &rlim_addr_space);
// Get limit returns 0 if succeded, let's check that.
if(!retval) {
addr_space = rlim_addr_space.rlim_cur;
fprintf(stdout, "Current address_space: %lu Bytes, or %lu GB, or %f EB\n", addr_space, btogb(addr_space), btoeb((double)addr_space));
} else {
fprintf(stderr, "Coundn\'t get address space current limit.");
return 1;
}
return 0;
}
I ran this on my computer and... prrrrrrrrrrrrrrrrr tsk!
Output: Current address_space: 18446744073709551615 Bytes, or 17179869183 GB, or 16.000000 EB
I have 16 ExaBytes of max address space available on my Linux x86_64.
Here's getrlimit()'s definition
it also lists the other constants you can pass to getrlimits() and introduces getrlimit()s sister setrlimit(). There is when the max member of rlimit becomes really important, you should always check you don't exceed this value so the kernel don't punch your face, drink your coffee and steal your papers.
PD: please excuse my sorry excuse of a drum roll ^_^
On Linux systems, see man ulimit
(UPDATED)
It says:
The ulimit builtin is used to set the resource usage limits of the
shell and any processes spawned by it. If a new limit value is
omitted, the current value of the limit of the resource is printed.
ulimit -a prints out all current values with switch options, other switches, e.g. ulimit -n prints out no. of max. open files.
Unfortunatelly, "max memory size" tells "unlimited", which means that it is not limited by system administrator.
You can view the memory size by
cat /proc/meminfo
Which results something like:
MemTotal: 4048744 kB
MemFree: 465504 kB
Buffers: 316192 kB
Cached: 1306740 kB
SwapCached: 508 kB
Active: 1744884 kB
(...)
So, if ulimit says "unlimited", the MemFree is all yours. Almost.
Don't forget that malloc() (and new operator, which calls malloc()) is a STDLIB function, so if you call malloc(100) 10 times, there will be lot of "slack", follow link to learn why.

tbb::parallel_for running out of memory on machine with 80 cores

I am trying to use tbb::parallel_for on a machine with 160 parallel threads (8 Intel E7-8870) and 0.5 TBytes of memory. It is a current Ubuntu system with kernel 3.2.0-35-generic #55-Ubuntu SMP. TBB is from the package libtbb2 Version 4.0+r233-1
Even with a very simple task, I tend to run out of resources, either "bad_alloc" or "thread_monitor Resource temporarily unavailable". I boiled it down to this very simple test:
#include <vector>
#include <cstdlib>
#include <cmath>
#include <iostream>
#include "tbb/tbb.h"
#include "tbb/task_scheduler_init.h"
using namespace tbb;
class Worker
{
std::vector<double>& dst;
public:
Worker(std::vector<double>& dst)
: dst(dst)
{}
void operator()(const blocked_range<size_t>& r ) const
{
for (size_t i=r.begin(); i!=r.end(); ++i)
dst[i] = std::sin(i);
}
};
int main(int argc, char** argv)
{
unsigned int n = 10000000;
unsigned int p = task_scheduler_init::default_num_threads();
std::cout << "Vector length: " << n << std::endl
<< "Processes : " << p << std::endl;
const size_t grain_size = n/p;
std::vector<double> src(n);
std::cerr << "Starting loop" << std::endl;
parallel_for(blocked_range<size_t>(0, n, grain_size), RandWorker(src));
std::cerr << "Loop finished" << std::endl;
}
Typical output is
Vector length: 10000000
Processes : 160
Starting loop
thread_monitor Resource temporarily unavailable
thread_monitor Resource temporarily unavailable
thread_monitor Resource temporarily unavailable
The errors appear randomly, and more frequent with greater n. The value of 10 million here is a point where they happen quite regularly. Nevertheless, given the machine characteristics, this should by far not exhaust the memory (I am using it alone for these tests).
The grain size was introduced after tbb created too many instances of the Worker, which made it fail for even smaller n.
Can anybody advise on how to set up tbb to handle large numbers of threads?
Summarizing the discussion in comments in an answer:
The message "thread_monitor Resource temporarily unavailable in pthread_create" basically tells that TBB cannot create enough threads; the "Resource temporarily unavailable" is what strerror() reports for the error code returned by pthread_create(). One possible reason for this error is insufficient memory to allocate stack for a new thread. By default, TBB requests 4M of stack for a worker thread; this value can be adjusted with a parameter to tbb::task_scheduler_init constructor if necessary.
In this particular case, as Guido Kanschat reported, the problem was caused by ulimit accidentally set which limited the memory available for the process.

Lua runs out of memory

I've written a complicated lua script which uses the lua sockets library. It reads a list of files from disk, sorts them by date and sends them to a HTTP process. The number of files on disk is around 65K.The memory usage in taskmanager doesn't exceed 200Mb.
After quite a while the script returns:
lua: not enough memory
I print out the current GC count at points and it never goes above 110Mb
local freeMem = collectgarbage('count');
print("GC Count : " .. freeMem/1024 .. " MB");
This is on a 32 bit windows machine.
What's the best way to diagnose this?
All memory goes through the single lua_Alloc function. This takes the form of:
typedef void* (*lua_Alloc) (void* ud, void* ptr, size_t oszie, size_t nsize);
All allocations, reallocations and frees go through this. The documentation for this can be found at this web page. You can easily write your own to track all memory operations. For example,
void* MyAlloc (void* ud, void* ptr, size_t osize, size_t nsize)
{
(void)ud; (void)osize; // Not used
if (nsize == 0)
{
free(ptr)
TrackSubtract(osize);
return NULL;
}
else
{
void* p = realloc(ptr,nsize);
TrackSubtract(osize);
if (p) TrackAdd(nsize);
return p;
}
}
You can write the TrackAdd() and TrackSubtract() functions to whatever you want: output to a log; adjust a counter and so on.
To use your new function you pass a pointer to it when you create the Lua state:
lua_State* L = lua_newstate(&MyAlloc,0);
The documentation to lua_newstate is found here.
Good luck.
Use perfmon to monitor your process and add counters for private bytes and virtual bytes.
When your script ends with 'not enough memory' see the value of each counter. If you see sudden peaks in your memory usage, try to add more points in which you print the memory usage.

Resources