Torch: RNN clones run out of GPU memory - lua

Karpathy's char-rnn (based on Wojciechz learning_to_execute) uses a common RNN hack:
clone a prototype network as many times as there are time steps per sequence
share the parameters between the clones
I can watch my 5GB GPU memory run out when I clone 217 times (the threshold is likely lower), resulting in this:
lua
opt/torch/install/share/lua/5.1/torch/File.lua:270: cuda runtime error (2) : out of memory at /mounts/Users/student/davidk/opt/torch/extra/cutorch/lib/THC/THCStorage.cu:44
The problem is the clone_many_times() function (linked above). The clones seem to point to the same physical parameters storage in the prototype, but for some reason it still explodes.
Has anyone encountered this and/or have an idea how to train really long sequences?
(Same question asked here: https://github.com/karpathy/char-rnn/issues/108)

To run the model, I had to increase memory capacity on the GPUs. With Sun's Grid Engine, use -l h_vmem=8G for 8 GB
Otherwise, you can try torch-rnn. It uses Adam for optimization and hard-codes the RNN/LSTM forward/backward passes for space/time efficiency. This also avoids headaches with cloning models.

Related

Why memory RSS is low but throughput is high

I'm working on a NUMA-related benchmark and got an issue troubling me for a week.
I use numactl to pin a metrics multiplication workload to use node0's CPU and node1's memory like this:
workload
However, I observe tons of memory access still goes to node0 (throughput is high), either using pcm or numatop like this:
enter image description here enter image description here
Then I track the pages mapping of the workload by observing /proc/<pid>/numa_maps and dump the # pages on both NUMA nodes, throughout workload running:
enter image description here
Funny thing is, node0's # active pages remains low all the way.
This doesn't make since to me why RSS is low but memory throughput is high?
FYI, the above workload is in Python that uses Numpy, when I run with Golang using another go library, both RSS and throughput for node0 are low. I couldn't figure out what happens in Python runtime that causes tons memory access goes to node0.
I assume there would be dynamic libs reside on node0 during runtime, so I track those libs (see figure) and evict them from memory before running. However, the result stays the same.

Can hardware threads access main memory at the same time?

I am trying to understand microarchitecture.
When an operating system schedules code to run on a CPU hardware thread (as in Intel HyperThreading), can each execution context issue memory reads in parallel or is the pipeline shared?
I am trying to do some rough calculations and complexity analysis and I want to know if memory bandwidth is shared and if I should divide my calculation by the number of cores or hardware threads (assuming the pipeline is shared) or hardware threads (the memory bandwidth is parallel).
Yes, the pipeline is shared, so it's possible for each the two load execution units in a physical core to be running a uop from a different logical core, accessing L1d in parallel. (e.g. https://www.realworldtech.com/haswell-cpu/5/ / https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram)
Off-core (L2 miss) bandwidth doesn't scale with number of logical cores, and one thread per core can fairly easily saturated it, especially with SIMD, if your code has high throughput (not bottlenecking on latency or branch misses), and low computational intensity (ALU work per load of data into registers. Or into L1d or L2 cache, whichever you're cache-blocking for). e.g. like a dot product.
Well-tuned high-throughput (instructions per cycle) code like linear algebra stuff (especially matmul) often doesn't benefit from more than 1 thread per physical core, instead suffering more cache misses when two threads are competing for the same L1d / L2 cache.
Cache-blocking aka loop tiling can help a lot, if you can loop again over a smaller chunk of data while it's still hot in cache. See How much of ‘What Every Programmer Should Know About Memory’ is still valid? (most of it).

Castalia Memory Issue

My application layer protocol works fine, but when the number of nodes is large (more than 600) it exits without any error.
I traced the code and didn't find any problem. It seems a memory problem since the number of nodes is large and doing many operations.
Update:
In my application:
Each node broadcasts 2msg/second, during all the simulation time.
The msgs contain much information related to my application.
All the nodes are static.
Using BypassRouting, BypassMAC, Radio cc2420.
Castalia works for nodes larger than 600 and reaches to 2500 from my previous experiments but with low simulation time ... so it depends on the relation between the # of nodes and simulation time and # of sent messages per second.
Single experiment run successfully... but when running for example with 30 seed (i.e. -r 30) ... & num of nodes = 110
its stopped after exp 13 simulation time = 1000s
& its stopped after exp 22 if simulation time = 600s
How I can free memory from unnecessary things during simulation runs.
(note: previously I increased the swap memory and worked for a specific limit)
Thanks,
Without more information on your application and the simulation scenario it's hard to provide very specific suggestions. At the very least, you could provide your ini file and information about any custom modules you are using (your application module for example). Are you using any mobile nodes for example? Which protocols are you using? What does you app module do? In general Castalia should be able to handle 600 nodes. In the past, we have tested Castalia with thousands of (static) nodes.
You could use a memory profiler. An excellent tool (a suite of tools really) is valgrind. You can find memory leaks, and you can also memory profile your program. The heap profiler tool of valgrind is called 'massif':
Massif is a heap profiler. It performs detailed heap profiling by taking regular snapshots of a program's heap. It produces a graph showing heap usage over time, including information about which parts of the program are responsible for the most memory allocations. The graph is supplemented by a text or HTML file that includes more information for determining where the most memory is being allocated. Massif runs programs about 20x slower than normal.
Read the valgrind documentation for more info. This is the way you invoke the tool:
valgrind --tool=massif <executable> <arguments>
The executable in this case is CastaliaBin (not the Castalia python script, which is a higher level execution tool).

Memory management when using GPU in TensorFlow

I have some doubts about using GPU in Tensorflow. I was following convolutional neural network tutorial here (tensorflow/models/image/cifar10/cifar10_train.py). As in the tutorial, all parameters (e.g., weights) are stored and updated in CPU memory and GPUs are only used to compute gradients or inference.
Since the weights are stored in CPU, they should be synchronized every iteration and it seems that GPU is underutilized (about 60% according to nvidia-smi). In case of using multiple GPUs, I understand that weights should be stored in CPU memory to synchronize between the GPUs. However, why does this tutorial store all weights in CPU even in single GPU? Is there any way to store and update them in GPU memory?
In case of inference, does the weights copied to GPU once and reuse them? or should they copied every time they are used?
How about image data? It seems that those data reside in GPU (not sure). When does this data transferred to GPU? When they are loaded from disk? or when they are required in the GPU?
If they are copied to GPU right after they are loaded from disk, what happens if size of the image data is too large to fit in the GPU memory? In such case, there is any way to copy data separately (something like prefetching)?
If they are copied to GPU on demand, is there any way to prefetch them before they are actually used by GPU to avoid idle time?
EDIT: It would be helpful if there is any way to check where the send/recv nodes are inserted between CPU and GPU (as in the white paper).
Those tutorials are meant to show off the API, so they don't optimize for performance. It's faster to keep variable on GPU for single tower model, and also faster for multi-tower model when you have p2p communication enabled between GPU. To pin variables to GPU, use the same tf.device('/gpu:0') approach as for any other op.
You can see all the memory copies between GPUs if you enable partition graphs, ie do something like this:
metadata = tf.RunMetadata()
sess.run(x, options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE,
output_partition_graphs=True),
run_metadata=metadata)
timeline = Timeline(metadata.step_stats)
with open("dynamic_stitch_gpu_profile.json", "w") as f:
f.write(timeline.generate_chrome_trace_format())
with open("dynamic_stitch_gpu_profile.pbtxt", "w") as f:
f.write(str(metadata))
See this issue for an example of using this technique to track down copies:
https://github.com/tensorflow/tensorflow/issues/7251#issuecomment-277385212
For prefetching to GPU, see this issue
There are new stage_op ops that have been added that allow prefetching to GPU and are dramatically faster than using Python queue runner approach. They are in process of being documented.

Software memory bit-flip detection for platforms without ECC

Most available desktop (cheap) x86 platforms now still nave no ECC memory support (Error Checking & Correction). But the rate of memory bit-flip errors is still growing (not the best SO thread, Large scale CERN 2007 study "Data integrity": "Bit Error Rate of 10-12 for their memory modules ... observed error rate is 4 orders of magnitude lower than expected"; 2009 Google's "DRAM Errors in the Wild: A Large-Scale Field Study"). For current hardware with data-intensive load (8 GB/s of reading) this means that single bit flip may occur every minute (10-12 vendors BER from CERN07) or once in two days (10-16 BER from CERN07). Google09 says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM ("mean correctable error rates of 2000–6000 per GB per year").
So, I want to know, is it possible to add some kind of software error detection in system-wide manner (check both user and kernel memory). For example, create a patch for Linux kernel and/or to system compiler to add some checksumming of every memory page, and try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
I also understand that better way of data protection from memory bitflips is to switch to ECC hardware, but most PC there are still non-ECC.
The thing is, ECC is dirt cheap compared to "software ECC countermeasures". You can easily detect if they have ECC modules and complain (or print a warning) when they don't.
http://www.cyberciti.biz/faq/ecc-memory-modules/
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
Er, you you will never "see" the bit-flips on the bus. They are literally caused by a particle hitting RAM, flipping a bit. Only much later can you notice that you read out something different than your wrote in. To detect this only via the bus, you would need a duplicate copy of all your RAM (i.e. create a shadow copy of what is in your real RAM, so you can verify every read returns what was written to that location.)
try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
The Redis guy has a nice write-up on an algorithm for testing RAM for problems. http://antirez.com/news/43 But this is really looking for RAM errors, not random bit-flips.
If "recompute checksums" only works when you are NOT writing to the memory. That might be "good enough" but you'll need to figure out which pages are not being written to.
To catch 100% of the errors, every write must be pre-ceeded by computing the checksum of that block of memory, then comparing it to the recorded checksum (to make sure that block hasn't degraded in RAM). Only then is it safe to do the write and then update the checksum. As you can imagine, the performance of this will be horrible (at least 100x slower) performance.
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
Well, there is a simple method to detect 100% of the errors, at a cost of 50% performance: Just run the computation on 2 boxes at once (or on one box at two different times, maybe with a RAM test in between if you are paranoid.) If the results differ, you have detected an error.
See also:
https://www.linuxquestions.org/questions/linux-hardware-18/how-to-detect-ecc-memory-errors-under-linux-886011/
The answer to the question is yes, and a proof for that is the software SoftECC posted in the comments!
Just a note that SoftECC is a kernel level solution. If a user-land app is used, it will be a third stage of redundancy, that seems not necessary.

Resources