Does pytorch support repeating a tensor without allocating significantly more memory?
Assume we have a tensor
t = torch.ones((1,1000,1000))
t10 = t.repeat(10,1,1)
Repeating t 10 times will require take 10x the memory. Is there a way how I can create a tensor t10 without allocating significantly more memory?
Here is a related question, but without answers.
You can use torch.expand
t = torch.ones((1, 1000, 1000))
t10 = t.expand(10, 1000, 1000)
Keep in mind that the t10 is just a reference to t. So for example, a change to t10[0,0,0] will result in the same change in t[0,0,0] and every member of t10[:,0,0].
Other than direct access, most operations performed on t10 will cause memory to be copied which will break the reference and cause more memory to be used. For example: changing the device (.cpu(), .to(device=...), .cuda()), changing the datatype (.float(), .long(), .to(dtype=...)), or using .contiguous().
Related
I am writing an algorithm which all blocks are reading a same address. Such as we have a list=[1, 2, 3, 4], and all blocks are reading it and store it to their own shared memory...My test shows the more blocks reading it, the slower it will be...I guess no broadcast happen here? Any idea I can make it faster? Thank you!!!
I learnt from previous post that this can be broadcast in one wrap, seems can not happen in different wrap....(Actually in my case, the threads in one wrap are not reading a same location...)
Once list element is accessed by first warp of a SM unit, the second warp in same SM unit gets it from cache and broadcasts to all simt lanes. But another SM unit's warp may not have it in L1 cache so it fetches from L2 to L1 first.
It is similar in __constant__ memory but it requires same address to be accessed by all threads. Its latency is closer to register access. __constant__ memory is like instruction cache, you get more performance when all threads do same thing.
For example, if you have a Gaussian-filter that iterates over same coefficient-list of filter on all threads, it is better to use constant memory. Using shared memory does not have much advantage as the filter array is not scanned randomly. Shared memory is better when the filter array content is different per block or if it needs random access.
You can also combine constant memory and shared memory. Get half of list from constant memory, then the other half from shared memory. This should let 1024 threads hide latency of one memory type hidden behind the other.
If list is small enough, you can use registers directly (has to be compile-time known indices). But it increases register pressure and may decrease occupancy so be careful about this.
Some old cuda architectures (in case of fma operation) required one operand fetched from constant memory and the other operand from a register to achieve better performance in compute-bottlenecked algorithms.
In a test with 12000 floats as filter to be applied on all threads inputs, shared memory version with 128 threads-per-block completed work in 330 milliseconds while constant-memory version completed in 260 milliseconds and the L1 access performance was the real bottleneck in both versions so the real constant-memory performance is even better, as long as it is similar-index for all threads.
I'm looking to allocate a vector of small-sized structs.
This takes 30 milliseconds and increases linearly:
let v = vec![[0, 0, 0, 0]; 1024 * 1024];
This takes tens of microseconds:
let v = vec![0; 1024 * 1024];
Is there a more efficient solution to the first case? I'm okay with unsafe code.
Fang Zhang's answer is correct in the general case. The code you asked about is a little bit special: it could use alloc_zeroed, but it does not. As Stargateur also points out in the question comments, with future language and library improvements it is possible both cases could take advantage of this speedup.
This usually should not be a problem. Initializing a whole big vector at once probably isn't something you do extremely often. Big allocations are usually long-lived, so you won't be creating and freeing them in a tight loop -- the cost of initializing the vector will only be paid rarely. Sooner than resorting to unsafe, I would take a look at my algorithms and try to understand why a single memset is causing so much trouble.
However, if you happen to know that all-bits-zero is an acceptable initial value, and if you absolutely cannot tolerate the slowdown, you can do an end-run around the standard library by calling alloc_zeroed and creating the Vec using from_raw_parts. Vec::from_raw_parts is unsafe, so you have to be absolutely sure the size and alignment of the allocated memory is correct. Since Rust 1.44, you can use Layout::array to do this easily. Here's an example:
pub fn make_vec() -> Vec<[i8; 4]> {
let layout = std::alloc::Layout::array::<[i8; 4]>(1_000_000).unwrap();
// I copied the following unsafe code from Stack Overflow without understanding
// it. I was advised not to do this, but I didn't listen. It's my fault.
unsafe {
Vec::from_raw_parts(
std::alloc::alloc_zeroed(layout) as *mut _,
1_000_000,
1_000_000,
)
}
}
See also
How to perform efficient vector initialization in Rust?
vec![0; 1024 * 1024] is a special case. If you change it to vec![1; 1024 * 1024], you will see performance degrades dramatically.
Typically, for non-zero element e, vec![e; n] will clone the element n times, which is the major cost. For element equal to 0, there is other system approach to init the memory, which is much faster.
So the answer to your question is no.
I'm trying to regress a model with group-time fixed effects, and many dummies.
egen id_t = concat(id year), format(%15.0f)
areg y u2j* j2j* d1* d2* x1 x2, absorb(id_t) vce(r)
d1 and d2 are dummies, for each there is hundreds of possible values. u2j* is the interaction of one variable, u2j, with time dummies:
forvalues t=1980/2000 {
gen y_`t' = (year==`t')
gen u2jXy`t' = y_`t'*u2j
(...)
I'm running into a memory error trying to do this. All my dummies have size int, and all other variables are as small as they can be. What else can I try to resolve the memory issue?
The memory error, as I remember it, was
You tried to allocate 8xxxxm of memory (256m through...), but your system administrator has set max memory to 80g. See help memory
if you have too many dummy variables, you should look for alternative estimation techniques that are specifically designed for that.
I would suggest reghdfe by sergio correa. look here
https://github.com/sergiocorreia/reghdfe
Its very efficient for high dimentional fixed effect problems. You should not run into any memory problems anymore.
I've been doing some computationally intensive work in F#. Functions like Array.Parallel.map which use the .Net Task Parallel Library have sped up my code exponentially for a really quite minimal effort.
However, due to memory concerns, I remade a section of my code so that it can be lazily evaluated inside a sequence expression (this means I have to store and pass less information). When it came time to evaluate I used:
// processor and memory intensive task, results are not stored
let calculations : seq<Calculation> = seq { ...yield one thing at a time... }
// extract results from calculations for summary data
PSeq.iter someFuncToExtractResults results
Instead of:
// processor and memory intensive task, storing these results is an unnecessary task
let calculations : Calculation[] = ...do all the things...
// extract results from calculations for summary data
Array.Parallel.map someFuncToExtractResults calculations
When using any of the Array.Parallel functions I can clearly see all the cores on my computer kick into gear (~100% CPU usage). However the extra memory required means the program never finished.
With the PSeq.iter version when I run the program, there's only about 8% CPU usage (and minimal RAM usage).
So: Is there some reason why the PSeq version runs so much slower? Is it because of the lazy evaluation? Is there some magic "be parallel" stuff I am missing?
Thanks,
Other resources, source code implementations of both (they seem to use different Parallel libraries in .NET):
https://github.com/fsharp/fsharp/blob/master/src/fsharp/FSharp.Core/array.fs
https://github.com/fsharp/powerpack/blob/master/src/FSharp.PowerPack.Parallel.Seq/pseq.fs
EDIT: Added more detail to code examples and details
Code:
Seq
// processor and memory intensive task, results are not stored
let calculations : seq<Calculation> =
seq {
for index in 0..data.length-1 do
yield calculationFunc data.[index]
}
// extract results from calculations for summary data (different module)
PSeq.iter someFuncToExtractResults results
Array
// processor and memory intensive task, storing these results is an unnecessary task
let calculations : Calculation[] =
Array.Parallel.map calculationFunc data
// extract results from calculations for summary data (different module)
Array.Parallel.map someFuncToExtractResults calculations
Details:
The storing the intermediate array version runs quick (as far as it gets before crash) in under 10 minutes but uses ~70GB RAM before it crashes (64GB physical, the rest paged)
The seq version takes over 34mins and uses a fraction of the RAM (only around 30GB)
There's a ~billion values I'm calculating. Hence a billion doubles (at 64bits each) = 7.4505806GB. There's more complex forms of data... and a few unnecessary copies I'm cleaning up hence the current massive RAM usage.
Yes the architecture isn't great, the lazy evaluation is the first part of me attempting to optimize the program and/or batch up the data into smaller chunks
With a smaller dataset, both chunks of code output the same results.
#pad, I tried what you suggested, the PSeq.iter seemed to work properly (all cores active) when fed the Calculation[], but there is still the matter of RAM (it eventually crashed)
both the summary part of the code and the calculation part are CPU intensive (mainly because of large data sets)
With the Seq version I just aim to parallelize once
Based on your updated information, I'm shortening my answer to just the relevant part. You just need this instead of what you currently have:
let result = data |> PSeq.map (calculationFunc >> someFuncToExtractResults)
And this will work the same whether you use PSeq.map or Array.Parallel.map.
However, your real problem is not going to be solved. This problem can be stated as: when the desired degree of parallel work is reached in order to get to 100% CPU usage, there is not enough memory to support the processes.
Can you see how this will not be solved? You can either process things sequentially (less CPU efficient, but memory efficient) or you can process things in parallel (more CPU efficient, but runs out of memory).
The options then are:
Change the degree of parallelism to be used by these functions to something that won't blow your memory:
let result = data
|> PSeq.withDegreeOfParallelism 2
|> PSeq.map (calculationFunc >> someFuncToExtractResults)
Change the underlying logic for calculationFunc >> someFuncToExtractResults so that it is a single function that is more efficient and streams data through to results. Without knowing more detail, it's not simple to see how this could be done. But internally, certainly some lazy loading may be possible.
Array.Parallel.map uses Parallel.For under the hood while PSeq is a thin wrapper around PLINQ. But the reason they behave differently here is there is not enough workloads for PSeq.iter when seq<Calculation> is sequential and too slow in yielding new results.
I do not get the idea of using intermediate seq or array. Suppose data to be the input array, moving all calculations in one place is the way to go:
// Should use PSeq.map to match with Array.Parallel.map
PSeq.map (calculationFunc >> someFuncToExtractResults) data
and
Array.Parallel.map (calculationFunc >> someFuncToExtractResults) data
You avoid consuming too much memory and have intensive computation in one place which leads to better efficiency in parallel execution.
I had a problem similar to yours and solved it by adding the following to the solution's App.config file:
<runtime>
<gcServer enabled="true" />
<gcConcurrent enabled="true"/>
</runtime>
A calculation that was taking 5'49'' and showing roughly 22% CPU utilization on Process Lasso took 1'36'' showing roughly 80% CPU utilization.
Another factor that may influence the speed of parallelized code is whether hyperthreading (Intel) or SMT (AMD) is enabled in the BIOS. I have seen cases where disabling leads to faster execution.
I have an algorithm where I create two bi-dimensional arrays like this:
TYPE
TPtrMatrixLine = array of byte;
TCurMatrixLine = array of integer;
TPtrMatrix = array of TPtrMatrixLine;
TCurMatrix = array of TCurMatrixLine;
function x
var
PtrsMX: TPtrMatrix;
CurMx : TCurMatrix;
begin
{ Try to allocate RAM }
SetLength(PtrsMX, RowNr+1, ColNr+1);
SetLength(CurMx , RowNr+1, ColNr+1);
for all rows do
for all cols do
FillMatrixWithData; <------- CPU intensive task. It could take up to 10-20 min
end;
The two matrices have always the same dimension.
Usually there are only 2000 lines and 2000 columns in the matrix but sometimes it can go as high as 25000x6000 so for both matrices I need something like 146.5 + 586.2 = 732.8MB of RAM.
The problem is that the two blocks need to be contiguous so in most cases, even if 500-600MB of free RAM doesn't seem much on a modern computer, I run out of RAM.
The algorithm fills the cells of the array with data based on the neighbors of that cell. The operations are just additions and subtractions.
The TCurMatrixLine is the one that takes a lot or RAM since it uses integers to store data. Unfortunately, values stored may have sign so I cannot use Word instead of integers. SmallInt is too small (my values are bigger than SmallInt, but smaller than Word). I hope that if there is any other way to implement this, it needs not to add a lot of overhead, since processing a matrix with so many lines/column already takes a lot of time. In other words I hope that decreasing memory requirements will not increase processing time.
Any idea how to decrease the memory requirements?
[I use Delphi 7]
Update
Somebody suggested that each row of my array should be an independent uni-dimensional array.
I create as many rows (arrays) as I need and store them in TList. Sound very good. Obviously there will be no problem allocation such small memory blocks. But I am afraid it will have a gigantic impact on speed. I use now
TCurMatrixLine = array of integer;
TCurMatrix = array of TCurMatrixLine;
because it is faster than TCurMatrix= array of array of integer (because of the way data is placed in memory). So, breaking the array in independent lines may affect the speed.
The suggestion of using a signed 2 byte integer will greatly aid you.
Another useful tactic is to mark your exe as being LARGE_ADDRESS_AWARE by adding {$SetPEFlags IMAGE_FILE_LARGE_ADDRESS_AWARE} to your .dpr file. This will only help if you are running on 64 bit Windows and will increase your address space from 2GB to 4GB.
It may not work on Delphi 7 (I seem to recall you are using D7) and you must be using FastMM since the old Borland memory manager isn't compatible with large address space. If $SetPEFlags isn't available you can still mark the exe with EDITBIN.
If you still encounter difficulties then yet another trick is to do allocate smaller sub-blocks of memory and use a wrapper class to handle mapping indices to the appropriate sub-block and offset within. You can use a default index property to make this transparent to the calling code.
Naturally a block allocated approach like this does incur some processing overhead but it's your best bet if you are having troubles with getting contiguous blocks.
If the absolute values of elements of CurMx fits word then you can store it in word and use another array of boolean for its sign. It reduces 1 byte for each element.
Have you considered to manually allocate the data structure on the heap?
...and measured how this will affect the memory usage and the performance?
Using the heap might actually increase speed and reduce the memory usage, because you can avoid the whole array to be copied from one memory segment to another memory segment. (Eg. if your FillMatrixWithData are declared with a non-const open array parameter).