Memory issue with areg

Memory issue with areg - memory

I'm trying to regress a model with group-time fixed effects, and many dummies.
egen id_t = concat(id year), format(%15.0f)
areg y u2j* j2j* d1* d2* x1 x2, absorb(id_t) vce(r)
d1 and d2 are dummies, for each there is hundreds of possible values. u2j* is the interaction of one variable, u2j, with time dummies:
forvalues t=1980/2000 {
gen y_`t' = (year==`t')
gen u2jXy`t' = y_`t'*u2j
(...)
I'm running into a memory error trying to do this. All my dummies have size int, and all other variables are as small as they can be. What else can I try to resolve the memory issue?
The memory error, as I remember it, was
You tried to allocate 8xxxxm of memory (256m through...), but your system administrator has set max memory to 80g. See help memory

if you have too many dummy variables, you should look for alternative estimation techniques that are specifically designed for that.
I would suggest reghdfe by sergio correa. look here
https://github.com/sergiocorreia/reghdfe
Its very efficient for high dimentional fixed effect problems. You should not run into any memory problems anymore.

Related

How to efficiently create a large vector of items initialized to the same value?

I'm looking to allocate a vector of small-sized structs.
This takes 30 milliseconds and increases linearly:
let v = vec![[0, 0, 0, 0]; 1024 * 1024];
This takes tens of microseconds:
let v = vec![0; 1024 * 1024];
Is there a more efficient solution to the first case? I'm okay with unsafe code.

Fang Zhang's answer is correct in the general case. The code you asked about is a little bit special: it could use alloc_zeroed, but it does not. As Stargateur also points out in the question comments, with future language and library improvements it is possible both cases could take advantage of this speedup.
This usually should not be a problem. Initializing a whole big vector at once probably isn't something you do extremely often. Big allocations are usually long-lived, so you won't be creating and freeing them in a tight loop -- the cost of initializing the vector will only be paid rarely. Sooner than resorting to unsafe, I would take a look at my algorithms and try to understand why a single memset is causing so much trouble.
However, if you happen to know that all-bits-zero is an acceptable initial value, and if you absolutely cannot tolerate the slowdown, you can do an end-run around the standard library by calling alloc_zeroed and creating the Vec using from_raw_parts. Vec::from_raw_parts is unsafe, so you have to be absolutely sure the size and alignment of the allocated memory is correct. Since Rust 1.44, you can use Layout::array to do this easily. Here's an example:
pub fn make_vec() -> Vec<[i8; 4]> {
let layout = std::alloc::Layout::array::<[i8; 4]>(1_000_000).unwrap();
// I copied the following unsafe code from Stack Overflow without understanding
// it. I was advised not to do this, but I didn't listen. It's my fault.
unsafe {
Vec::from_raw_parts(
std::alloc::alloc_zeroed(layout) as *mut _,
1_000_000,
1_000_000,
)
}
}
See also
How to perform efficient vector initialization in Rust?

vec![0; 1024 * 1024] is a special case. If you change it to vec![1; 1024 * 1024], you will see performance degrades dramatically.
Typically, for non-zero element e, vec![e; n] will clone the element n times, which is the major cost. For element equal to 0, there is other system approach to init the memory, which is much faster.
So the answer to your question is no.

How is RAM able to acess any place in memory at O(1) speed

We are taught that the abstraction of the RAM memory is a long array of bytes. And that for the CPU it takes the same amount of time to access any part of it. What is the device that has the ability to access any byte out of the 4 gigabytes (on my computer) in the same time? As this does not seem as a trivial task for me.
I have asked colleagues and my professors, but nobody can pinpoint to the how this task can be achieved with simple logic gates, and if it isn't just a tricky combination of logic gates, then what is it?
My personal guess is that you could achieve the access of any memory in O(log(n)) speed, where n would be the size of memory. Because each gate would split the memory in two and send you memory access instruction to the next split the memory in two gate. But that requires ALOT of gates. I can't come up with any other educated guess, and I don't even know the name of the device that I should look up in Google.
Please help my anguished curiosity, and thanks in advance.
edit<
This is what I learned!
quote from yours "the RAM can send the value from cell addressed X to some output pins", here is where everyone skip (again) the thing that is not trivial for me. The way that I see it, In order to build a gate that from 64 pins decides which byte out of 2^64 to get, each pin needs to split the overall possible range of memory into two. If bit at index 0 is 0 -> then the address is at memory 0-2^64/2, else address is at memory 2^64/2-2^64. And so on, However the amout of gates (lets call them) that the memory fetch will go through will be 64, (a constant). However the amount of gates needed is N, where N is the number of memory bytes there are.
Just because there is 64 pins, it doesn't mean that you can still decode it into a single fetch from a range of 2^64. Does 4 gigabytes memory come with a 4 gigabytes gates in the memory control???
now this can be improved, because as I read furiously more and more about how this memory is architectured, if you place the memory into a matrix with sqrt(N) rows and sqrt(N) columns, the amount of gates that a fetch memory will need to go through is O(log(sqrt(N)*2) and the amount of gates that will be required will be 2*sqrt(N), which is much better, and I think that its probably a trade secret.
/edit<

What the heck, I might as well make this an answer.
Yes, in the physical world, memory access cannot be constant time.
But it cannot even be logarithmic time. The O(log n) circuit you have in mind ultimately involves some sort of binary (or whatever) tree, and you cannot make a binary tree with constant-length wires in a 3D universe.
Whatever the "bits per unit volume" capacity of your technology is, storing n bits requires a sphere with radius O(n^(1/3)). Since information can only travel at the speed of light, accessing a bit at the other end of the sphere requires time O(n^(1/3)).
But even this is wrong. If you want to talk about actual limitations of our universe, our physics friends say the absolute maximum number of bits you can store in any sphere is proportional to the sphere's surface area, not its volume. So the actual radius of a minimal sphere containing n bits of information is O(sqrt(n)).
As I mentioned in my comment, all of this is pretty much moot. The models of computation we generally use to analyze algorithms assume constant-access-time RAM, which is close enough to the truth in practice and allows a fair comparison of competing algorithms. (Although good engineers working on high-performance code are very concerned about locality and the memory hierarchy...)

Let's say your RAM has 2^64 cells (places where it is possible to store a single value, let's say 8-bit). Then it needs 64 pins to address every cell with a different number. When at the input pins of your RAM there 'appears' a binary number X the RAM can send the value from cell addressed X to some output pins, and your CPU can get the value from there. In hardware the addressing can be done quite easily, for example by using multiple NAND gates (such 'addressing device' from some logic gates is called a decoder).
So it is all happening at the hardware-level, this is just direct addressing. If the CPU is able to provide 64 bits to 64 pins of your RAM it can address every single memory cell (as 64 bit is enough to represent any number up to 2^64 -1). The only reason why you do not get the value immediately is a kind of 'propagation time', so time it takes for the signal to go through all the logic gates in the circuit.

The component responsible for dealing with memory accesses is the memory controller. It is used by the CPU to read from and write to memory.
The access time is constant because memory words are truly layed out in a matrix form (thus, the "byte array" abstraction is very realistic), where you have rows and columns. To fetch a given memory position, the desired memory address is passed on to the controller, which then activates the right column.
From http://computer.howstuffworks.com/ram1.htm:
Memory cells are etched onto a silicon wafer in an array of columns
(bitlines) and rows (wordlines). The intersection of a bitline and
wordline constitutes the address of the memory cell.
So, basically, the answer to your question is: the memory controller figures it out. Of course that, given a memory address, the mapping to column and row must be easy to calculate in a constant time.
To fully understand this topic, I recommend you to read this guide on how memory works: http://computer.howstuffworks.com/ram.htm
There are so many concepts to master that it is difficult to explain it all in one answer.

I've been reading your comments and questions until I answered. I think you are on the right track, but there is some confusion here. The random access in which you are implying doesn't exist in the same way you think it does.
Reading, writing, and refreshing are done in a continuous cycle. A particular cell in memory is only read or written in a certain interval if a signal is detected to do so in that cycle. There is going to be support circuitry that includes "sense amplifiers to amplify the signal or charge detected on a memory cell."
Unless I am misunderstanding what you are implying, your confusion is in how easy it is to read/write to a cell. It's different dependent on chip design but there IS a minimum number of cycles it takes to read or write data to a cell.
These are my sources:
http://www.doc.ic.ac.uk/~dfg/hardware/HardwareLecture16.pdf
http://www.electronics.dit.ie/staff/tscarff/memory/dram_cycles.htm
http://www.ece.cmu.edu/~ece548/localcpy/dramop.pdf
To avoid a humungous answer, I left most of the detail out but all three of these will describe the process you are looking for.

Working with large arrays - OutOfRam

I have an algorithm where I create two bi-dimensional arrays like this:
TYPE
TPtrMatrixLine = array of byte;
TCurMatrixLine = array of integer;
TPtrMatrix = array of TPtrMatrixLine;
TCurMatrix = array of TCurMatrixLine;
function x
var
PtrsMX: TPtrMatrix;
CurMx : TCurMatrix;
begin
{ Try to allocate RAM }
SetLength(PtrsMX, RowNr+1, ColNr+1);
SetLength(CurMx , RowNr+1, ColNr+1);
for all rows do
for all cols do
FillMatrixWithData; <------- CPU intensive task. It could take up to 10-20 min
end;
The two matrices have always the same dimension.
Usually there are only 2000 lines and 2000 columns in the matrix but sometimes it can go as high as 25000x6000 so for both matrices I need something like 146.5 + 586.2 = 732.8MB of RAM.
The problem is that the two blocks need to be contiguous so in most cases, even if 500-600MB of free RAM doesn't seem much on a modern computer, I run out of RAM.
The algorithm fills the cells of the array with data based on the neighbors of that cell. The operations are just additions and subtractions.
The TCurMatrixLine is the one that takes a lot or RAM since it uses integers to store data. Unfortunately, values stored may have sign so I cannot use Word instead of integers. SmallInt is too small (my values are bigger than SmallInt, but smaller than Word). I hope that if there is any other way to implement this, it needs not to add a lot of overhead, since processing a matrix with so many lines/column already takes a lot of time. In other words I hope that decreasing memory requirements will not increase processing time.
Any idea how to decrease the memory requirements?
[I use Delphi 7]
Update
Somebody suggested that each row of my array should be an independent uni-dimensional array.
I create as many rows (arrays) as I need and store them in TList. Sound very good. Obviously there will be no problem allocation such small memory blocks. But I am afraid it will have a gigantic impact on speed. I use now
TCurMatrixLine = array of integer;
TCurMatrix = array of TCurMatrixLine;
because it is faster than TCurMatrix= array of array of integer (because of the way data is placed in memory). So, breaking the array in independent lines may affect the speed.

The suggestion of using a signed 2 byte integer will greatly aid you.
Another useful tactic is to mark your exe as being LARGE_ADDRESS_AWARE by adding {$SetPEFlags IMAGE_FILE_LARGE_ADDRESS_AWARE} to your .dpr file. This will only help if you are running on 64 bit Windows and will increase your address space from 2GB to 4GB.
It may not work on Delphi 7 (I seem to recall you are using D7) and you must be using FastMM since the old Borland memory manager isn't compatible with large address space. If $SetPEFlags isn't available you can still mark the exe with EDITBIN.
If you still encounter difficulties then yet another trick is to do allocate smaller sub-blocks of memory and use a wrapper class to handle mapping indices to the appropriate sub-block and offset within. You can use a default index property to make this transparent to the calling code.
Naturally a block allocated approach like this does incur some processing overhead but it's your best bet if you are having troubles with getting contiguous blocks.

If the absolute values of elements of CurMx fits word then you can store it in word and use another array of boolean for its sign. It reduces 1 byte for each element.

Have you considered to manually allocate the data structure on the heap?
...and measured how this will affect the memory usage and the performance?
Using the heap might actually increase speed and reduce the memory usage, because you can avoid the whole array to be copied from one memory segment to another memory segment. (Eg. if your FillMatrixWithData are declared with a non-const open array parameter).

how to get page size

I was asked this question in an interview Plz tell me the answer :-
You have no documentation of the kernel. You only knows that you kernel supports paging.
How will you find that page size ? There is no flag or macro you have that can tell you about page size.
I was given the hint as you can use Time to get the answer. I still have no clue for it.

Run code like the following:
for (int stride = 1; stride < maxpossiblepagesize; stride += searchgranularity) {
char* somemem = (char*)malloc(veryverybigsize*stride);
starttime = getcurrentveryaccuratetime();
for (pos = somemem; pos < somemem+veryverybigsize*stride; pos += stride) {
// iterate over "veryverybigsize" chunks of size "stride"
*pos = 'Q'; // Just write something to force the page back into physical memory
}
endtime = getcurrentveryaccuratetime();
printf("stride %u, runtime %u", stride, endtime-starttime);
}
Graph the results with stride on the X axis and runtime on the Y axis. There should be a point at stride=pagesize, where the performance no longer drops.
This works by incurring a number of page faults. Once stride surpasses pagesize, the number of faults ceases to increase, so the program's performance no longer degrades noticeably.
If you want to be cleverer, you could exploit the fact that the mprotect system call must work on whole pages. Try it with something smaller, and you'll get an error. I'm sure there are other "holes" like that, too - but the code above will work on any system which supports paging and where disk access is much more expensive than RAM access. That would be every seminormal modern system.

It looks to me like a question about 'how does paging actually work'
They want you to explain the impact that changing the page size will have on the execution of the system.
I am a bit rusty on this stuff, but when a page is full, the system starts page swapping, which slows everything down. So you want to run something that will fill up the memory to different sizes, and measure the time it takes to do a task. At some point there will be a jump, where the time taken to do the task will suddenly jump.
Like I said I am a bit rusty on the implementation of doing this. But i'm pretty sure that is the shape of the answer they were after.

Whatever answer they were expecting it would almost certainly be a brittle solution. For one thing you can have multiple pages sizes so any answer you may have gotten for one small allocation may be irrelevant for the next multi-megabyte allocation (see things like Linux's Large Page support).
I suspect the question was more aimed at seeing how you approached the problem rather than the final solution you came up with.
By the way this question isn't about linux because you do have documentation for that as well as POSIX compliance, for which you just call sysconf(_SC_PAGE_SIZE).

Why is Erlang crashing on large sequences?

I have just started learning Erlang and am trying out some Project Euler problems to get started. However, I seem to be able to do any operations on large sequences without crashing the erlang shell.
Ie.,even this:
list:seq(1,64000000).
crashes erlang, with the error:
eheap_alloc: Cannot allocate 467078560 bytes of memory (of type "heap").
Actually # of bytes varies of course.
Now half a gig is a lot of memory, but a system with 4 gigs of RAM and plenty of space for virtual memory should be able to handle it.
Is there a way to let erlang use more memory?

Your OS may have a default limit on the size of a user process. On Linux you can change this with ulimit.
You probably want to iterate over these 64000000 numbers without needing them all in memory at once. Lazy lists let you write code similar in style to the list-all-at-once code:
-module(lazy).
-export([seq/2]).
seq(M, N) when M =< N ->
fun() -> [M | seq(M+1, N)] end;
seq(_, _) ->
fun () -> [] end.
1> Ns = lazy:seq(1, 64000000).
#Fun<lazy.0.26378159>
2> hd(Ns()).
1
3> Ns2 = tl(Ns()).
#Fun<lazy.0.26378159>
4> hd(Ns2()).
2

Possibly a noob answer (I'm a Java dev), but the JVM artificially limits the amount of memory to help detect memory leaks more easily. Perhaps erlang has similar restrictions in place?

This is a feature. We do not want one processes to consume all memory. It like the fuse box in your house. For the safety of us all.
You have to know erlangs recovery model to understand way they let the process just die.

Also, both windows and linux have limits on the maximum amount of memory an image can occupy
As I recall on linux it is half a gigabyte.
The real question is why these operations aren't being done lazily ;)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart