I'd like to get the amount of "free memory" per NUMA node.
When dealing with a whole machine, one usually parses /proc/meminfo like free does (the number wanted is MemFree + Buffers + Cached).
There also exist /sys/devices/system/node/nodex/meminfo, which seem to display numbers per NUMA node. Does anybody know how these numbers can be correlated to the content of /proc/meminfo? My trivial assumption would be to sum up some numbers for all NUMA nodes in the system, and the result is equal to some number in /proc/meminfo. But so far I failed to figure out the relationships, especially for page caches.
The code for proc is in fs/proc/meminfo.c, for the sysfs files it's in drivers/base/node.c. Comparing them might give you some hints.
Note that you'll probably never get the numbers to add up 100%, because you can't atomically read the content of all the files, so the values will change while you're reading them.
There also seems to be an inconsistency in the total RAM reported via both methods. One explanation for that is that free_init_mem doesn't appear to be NUMA aware, and increments total_ram_pages but does not do any NUMA accounting.
Related
I have 93 arrays. Each array has about 18 values in average
I need to make a product of these arrays.
So I have my two dimension array that store these 93 arrays.
Here is what I try to do
DATASET.first.product(*DATASET[1..-1])
Ruby returns
RangeError: too big to product
Does anyone know some workaround to figure out of it?
Some ways to chunk them?
What you want is impossible.
The product of 93 arrays with ~18 elements each is an array with approximately 549975033204266172374216967425209467080301768557741749051999338598022831065169332830885722071173603516904554174087168 elements, each of which is a 93-element array.
This means you need 549975033204266172374216967425209467080301768557741749051999338598022831065169332830885722071173603516904554174087168 * 93 * 64bit of memory to store it, which is roughly 409181424703974032246417423764355843507744515806959861294687507916928986312485983626178977220953161016576988305520852992 bytes. That is about 40 orders of magnitude more than the number of particles in the universe. In other words, even if you were to convert the entire universe into RAM, you would still need to find a way to store on the order of 827180612553027 yobibyte on each and every particle in the universe; that is about 6000000000000000000000000 times the information content of the World Wide Web and 10000000000000000000000 times the information content of the dark web.
Does anyone know some workaround to figure out of it? Some ways to chunk them?
Even if you process them in chunks, that doesn't change the fact that you still need to process 51147678087996754030802177970544480438468064475869982661835938489616123289060747953272372152619145127072123538190106624 elements. Even if you were able to process one element per CPU instruction (which is unrealistic, you will probably need dozens if not hundreds of instructions), and even if each instruction only takes one clock cycle (which is unrealistic, on current mainstream CPUs, each instruction takes multiple clock cycles), and even if you had a terahertz CPU (which is unrealistic, the fastest current CPUs top out at 5 GHz), and even if your CPU had a million cores (which is unrealistic, even GPUs only have a couple of thousand extremely simple cores), and even if your motherboard had a million sockets (which is unrealistic, mainstream motherboards only have a maximum of 4 sockets, and even the biggest supercomputers only have 10 million cores in total), and even if you had a million of those computers in a cluster, and even if you had a million of those clusters in a supercluster, and even if you had a million friends that also have a supercluster like this, it would still take you about 1621000000000000000000000000000000000000000000000000000000000000000000 years to iterate through them.
Right, so as it is hopefully clear that this should not be attempted I'll take a risk and attempt solving your actual problem.
You've mentioned in the comments that you need this array for property testing - I'll take a massive leap of faith here and assume you want to test that every possible combination satisfies some conditions - and this is the mistake here, as the amount of possible combination is just... large...
Instead, you can test that some of the combinations works. You can easily generate a short, randomized list of combinations using:
Array.new(num) { DATASET.map(&:sample) }
Where num is a number of combinations you want to test. Note that there is a chance that some of the entries will be duplicated - but given your dataset size the chances would be comparable with colliding uuids and can be safely ignored.
Generating such a subset of possible solutions is much easier, faster and, most importantly, possible. Since the output is randomized, it will test slightly different combination on each run, so remember to have some randomization setup in your test suite if you want to be able to recreate failures.
I am trying to build a Sieve of Eratosthenes in Lua and i tried several things but i see myself confronted with the following problem:
The tables of Lua are to small for this scenario. If I just want to create a table with all numbers (see example below), the table is too "small" even with only 1/8 (...) of the number (the number is pretty big I admit)...
max = 600851475143
numbers = {}
for i=1, max do
table.insert(numbers, i)
end
If I execute this script on my Windows machine there is an error message saying: C:\Program Files (x86)\Lua\5.1\lua.exe: not enough memory. With Lua 5.3 running on my Linux machine I tried that too, error was just killed. So it is pretty obvious that lua can´t handle the amount of entries.
I don´t really know whether it is just impossible to store that amount of entries in a lua table or there is a simple solution for this (tried it by using a long string aswell...)? And what exactly is the largest amount of entries in a Lua table?
Update: And would it be possible to manually allocate somehow more memory for the table?
Update 2 (Solution for second question): The second question is an easy one, I just tested it by running every number until the program breaks: 33.554.432 (2^25) entries fit in one one-dimensional table on my 12 GB RAM system. Why 2^25? Because 64 Bit per number * 2^25 = 2147483648 Bits which are exactly 2 GB. This seems to be the standard memory allocation size for the Lua for Windows 32 Bit compiler.
P.S. You may have noticed that this number is from the Euler Project Problem 3. Yes I am trying to accomplish that. Please don´t give specific hints (..). Thank you :)
The Sieve of Eratosthenes only requires one bit per number, representing whether the number has been marked non-prime or not.
One way to reduce memory usage would be to use bitwise math to represent multiple bits in each table entry. Current Lua implementations have intrinsic support for bitwise-or, -and etc. Depending on the underlying implementation, you should be able to represent 32 or 64 bits (number flags) per table entry.
Another option would be to use one or more very long strings instead of a table. You only need a linear array, which is really what a string is. Just have a long string with "t" or "f", or "0" or "1", at every position.
Caveat: String manipulation in Lua always involves duplication, which rapidly turns into n² or worse complexity in terms of performance. You wouldn't want one continuous string for the whole massive sequence, but you could probably break it up into blocks of a thousand, or of some power of 2. That would reduce your memory usage to 1 byte per number while minimizing the overhead.
Edit: After noticing a point made elsewhere, I realized your maximum number is so large that, even with a bit per number, your memory requirements would optimally be about 73 gigabytes, which is extremely impractical. I would recommend following the advice Piglet gave in their answer, to look at Jon Sorenson's version of the sieve, which works on segments of the space instead of the whole thing.
I'll leave my suggestion, as it still might be useful for Sorenson's sieve, but yeah, you have a bigger problem than you realize.
Lua uses double precision floats to represent numbers. That's 64bits per number.
600851475143 numbers result in almost 4.5 Terabytes of memory.
So it's not Lua's or its tables' fault. The error message even says
not enough memory
You just don't have enough RAM to allocate that much.
If you would have read the linked Wikipedia article carefully you would have found the following section:
As Sorenson notes, the problem with the sieve of Eratosthenes is not
the number of operations it performs but rather its memory
requirements.[8] For large n, the range of primes may not fit in
memory; worse, even for moderate n, its cache use is highly
suboptimal. The algorithm walks through the entire array A, exhibiting
almost no locality of reference.
A solution to these problems is offered by segmented sieves, where
only portions of the range are sieved at a time.[9] These have been
known since the 1970s, and work as follows
...
We are taught that the abstraction of the RAM memory is a long array of bytes. And that for the CPU it takes the same amount of time to access any part of it. What is the device that has the ability to access any byte out of the 4 gigabytes (on my computer) in the same time? As this does not seem as a trivial task for me.
I have asked colleagues and my professors, but nobody can pinpoint to the how this task can be achieved with simple logic gates, and if it isn't just a tricky combination of logic gates, then what is it?
My personal guess is that you could achieve the access of any memory in O(log(n)) speed, where n would be the size of memory. Because each gate would split the memory in two and send you memory access instruction to the next split the memory in two gate. But that requires ALOT of gates. I can't come up with any other educated guess, and I don't even know the name of the device that I should look up in Google.
Please help my anguished curiosity, and thanks in advance.
edit<
This is what I learned!
quote from yours "the RAM can send the value from cell addressed X to some output pins", here is where everyone skip (again) the thing that is not trivial for me. The way that I see it, In order to build a gate that from 64 pins decides which byte out of 2^64 to get, each pin needs to split the overall possible range of memory into two. If bit at index 0 is 0 -> then the address is at memory 0-2^64/2, else address is at memory 2^64/2-2^64. And so on, However the amout of gates (lets call them) that the memory fetch will go through will be 64, (a constant). However the amount of gates needed is N, where N is the number of memory bytes there are.
Just because there is 64 pins, it doesn't mean that you can still decode it into a single fetch from a range of 2^64. Does 4 gigabytes memory come with a 4 gigabytes gates in the memory control???
now this can be improved, because as I read furiously more and more about how this memory is architectured, if you place the memory into a matrix with sqrt(N) rows and sqrt(N) columns, the amount of gates that a fetch memory will need to go through is O(log(sqrt(N)*2) and the amount of gates that will be required will be 2*sqrt(N), which is much better, and I think that its probably a trade secret.
/edit<
What the heck, I might as well make this an answer.
Yes, in the physical world, memory access cannot be constant time.
But it cannot even be logarithmic time. The O(log n) circuit you have in mind ultimately involves some sort of binary (or whatever) tree, and you cannot make a binary tree with constant-length wires in a 3D universe.
Whatever the "bits per unit volume" capacity of your technology is, storing n bits requires a sphere with radius O(n^(1/3)). Since information can only travel at the speed of light, accessing a bit at the other end of the sphere requires time O(n^(1/3)).
But even this is wrong. If you want to talk about actual limitations of our universe, our physics friends say the absolute maximum number of bits you can store in any sphere is proportional to the sphere's surface area, not its volume. So the actual radius of a minimal sphere containing n bits of information is O(sqrt(n)).
As I mentioned in my comment, all of this is pretty much moot. The models of computation we generally use to analyze algorithms assume constant-access-time RAM, which is close enough to the truth in practice and allows a fair comparison of competing algorithms. (Although good engineers working on high-performance code are very concerned about locality and the memory hierarchy...)
Let's say your RAM has 2^64 cells (places where it is possible to store a single value, let's say 8-bit). Then it needs 64 pins to address every cell with a different number. When at the input pins of your RAM there 'appears' a binary number X the RAM can send the value from cell addressed X to some output pins, and your CPU can get the value from there. In hardware the addressing can be done quite easily, for example by using multiple NAND gates (such 'addressing device' from some logic gates is called a decoder).
So it is all happening at the hardware-level, this is just direct addressing. If the CPU is able to provide 64 bits to 64 pins of your RAM it can address every single memory cell (as 64 bit is enough to represent any number up to 2^64 -1). The only reason why you do not get the value immediately is a kind of 'propagation time', so time it takes for the signal to go through all the logic gates in the circuit.
The component responsible for dealing with memory accesses is the memory controller. It is used by the CPU to read from and write to memory.
The access time is constant because memory words are truly layed out in a matrix form (thus, the "byte array" abstraction is very realistic), where you have rows and columns. To fetch a given memory position, the desired memory address is passed on to the controller, which then activates the right column.
From http://computer.howstuffworks.com/ram1.htm:
Memory cells are etched onto a silicon wafer in an array of columns
(bitlines) and rows (wordlines). The intersection of a bitline and
wordline constitutes the address of the memory cell.
So, basically, the answer to your question is: the memory controller figures it out. Of course that, given a memory address, the mapping to column and row must be easy to calculate in a constant time.
To fully understand this topic, I recommend you to read this guide on how memory works: http://computer.howstuffworks.com/ram.htm
There are so many concepts to master that it is difficult to explain it all in one answer.
I've been reading your comments and questions until I answered. I think you are on the right track, but there is some confusion here. The random access in which you are implying doesn't exist in the same way you think it does.
Reading, writing, and refreshing are done in a continuous cycle. A particular cell in memory is only read or written in a certain interval if a signal is detected to do so in that cycle. There is going to be support circuitry that includes "sense amplifiers to amplify the signal or charge detected on a memory cell."
Unless I am misunderstanding what you are implying, your confusion is in how easy it is to read/write to a cell. It's different dependent on chip design but there IS a minimum number of cycles it takes to read or write data to a cell.
These are my sources:
http://www.doc.ic.ac.uk/~dfg/hardware/HardwareLecture16.pdf
http://www.electronics.dit.ie/staff/tscarff/memory/dram_cycles.htm
http://www.ece.cmu.edu/~ece548/localcpy/dramop.pdf
To avoid a humungous answer, I left most of the detail out but all three of these will describe the process you are looking for.
This question was asked in one of the big software company. I have come up with a simple solution and I want to know what others feel about the solution.
You are supposed to design an API and a backend for a system that can
allot phone numbers to people living in a city. The phone numbers will
start from 111-111-1111 and end at 999-999-9999. The API should enable
the clients (people in the city) to do the following:
When a client requests for a phone number, it allots one of the available numbers to them.
Some clients may want fancy numbers, so they can specifically ask for a number to be alloted to them. If the requested number is
available then the system allots it to them, otherwise the system
allots any available number.
The system need not have to know which number is alloted to which
client. The same client may make successive requests and get multiple
phone numbers for himself, but the system is not bothered. At any
point of time, the system only knows which phone numbers are alloted
and which phone numbers are free.
The numbers from 111-111-1111 to 999-999-9999 roughly corresponds to 8 billion numbers. Assuming that memory is not a constraint, I can think of the following two approaches (which are almost similar).
Maintain a huge boolean array of length 8 billion and have a next pointer that points to an array index (next is initialized to zero). If the value pointed by next is not free, then forward next until a free number is found. When fancy numbers are requested, just check whether the corresponding index position is free and return the number. The downside of this approach is, when allocating numbers in a regular way, if there is a huge chunk (say 1 billion) numbers in the middle that was allocated by fancy allocation, then the next pointer has to be moved 1 billion times.
To overcome the difficulty mentioned in the previos design, we can use some sort of a linked hashmap. We maintain a doubly linked list (this replaces the array in the previous design) and another array of the same length as the list where each element of the array points to a corresponding element in the list. So when allocating numbers in regular method, we advance a pointer in the linked list and mark nodes as and when we allocate (same as the previous method). When allocating fancy numbers, we can directly find the node in the list that corresponds to the special number requested by first indexing into the array and the following the pointer. Once the node is identified, short circuit the previous node and the next node so that we do not have to skip the used numbers one by one (which was the problem with the previous approach) when doing a regular allocation.
Let me know whether I am on the right track. Please enlighten me with any important details that I am missing.
You can do significantly better in the anser to this question.
First you should design you API. The one recommended by Icarus3 is perfectly good:
string acquireNextAvailableNumber();
boolean acquireRequestedNumber(string special);
The second one returns true (and reserves the number) if it is available, otherwise returns false.
The question doesn't specify how you allocate phone numbers, so allocate them to suit yourself. Make the first 'next available' request return "111-111-1111", the next "111-111-1112" etc. This means you can record all the numbers allocated through 'next' by just remembering the last one allocated. (You'll need to ask whether '111-111-1119" is followed by "111-111-1120" or 111-111-1121", but that's the sort of thing you should be asking anyway. In any case, the important thing is you can work out what is the next number knowing the last allocated one.)
Special requests you will need to store individually. A hash table work, but so does a BST or simply an ordered list. It depends on what tradeoffs you want between space and speed, and how often special numbers are likely to be requested. I'll use a BST (ordered by the number) in the rest of this, for reasons I'll come to.
So, how do you code this? For the next allocated number:
Look at the last allocated number, and find the next in sequence.
Check that number hasn't been allocated as a special number. You can do this very quickly with a BST because if it's there, it will be the lowest entry in the BST.
If the number was in the 'special numbers' database, increment the 'allocated numbers' value (to include that number) and remove the entry from the special numbers. Then repeat this process until you get a number that isn't in the special numbers.
Note that this process ensures that all 'special numbers' lower than the last one allocated by 'next' do not appear in the special numbers database. As the 'last normal number allocated' increases, it absorbs any special numbers allocated that were less than that, removing them from the table. This is what ensures that when we ask whether the next number in sequence is in the special numbers database, we only have to look at the lowest entry.
Checking for a special number is easy. If it is lower than the last 'normal' number allocated it isn't available. Otherwise you check to see if it exists in the BST. If it doesn't, you add it to the BST.
You can optimize this process by storing not just single numbers in the BST, but storing ranges of numbers. If the allocated special numbers are dense, then it reduces the amount of space in the tree and the number of accesses to find if one is in there. During the test to find if the 'next' number discovers a rnage of size n, then you can immediately increment the highest normal number by n, instead of having to go round the loop n times.
First, you did not prototype your APIs. For example, if I have to design these APIs I will publish 2 APIs.
string acquireNextAvailableNumber();
string acquireRequestedNumber(string special);
Second, you need to decide how you are going to implement it. code driven or data driven ?
You can maintain hash for all these numbers ( it will consume memory ) and quickly query the availability of the number. Or
you could maintain single list to store only distributed numbers ( less memory ). So, whenever request comes, you start searching 1 to n numbers in that list ( increased time-complexity ). if any first (or requested) number isn't there then you allocate it to client and add that entry in the list.
As, there are billion numbers, you will need to consider the trade-off between space and time.
You could also take the advantage of the database.
To enhance previous answers, any BST may not be good enough as insertions or deletions can make it unbalanced. A balanced BST, e.g. Red-Black Tree, should be a good choice.
So, a Red-Black Tree can be created and filled in the beginning to represent available numbers, and each allocation should remove an element from it.
init(from, to) - can be done in O(n) time, a straightforward implementation would be O(n log n). But that is a one-time initialization on your server's start
acquireNextAvailableNumber() - should remove smallest element, time cost O(1)
acquireRequestedNumber(special) - should make a search and remove element if found, guaranteed time cost O(log n)
In Java, a TreeSet<String> or TreeSet<Integer> could be used since it is implemented with Red-Black Tree.
The next question would probably have been that several request-processing threads would access your API, so since Java's TreeSet is not thread-safe, you should have wrapped it at initialization like so:
TreeSet numbers = init(...);
SortedSet availableNumbers = Collections.synchronizedSortedSet(numbers);
Our teachers has asked us around 50 true of false questions in preparation for our final exam. I could find an answer for most of them online or by asking relative. How ever, those 4 questions adrive driving me crazy. Most of those question aren't that hard, I just cant get any satisfying answer anywhere. Sorry, the original question are not written in english, i had to translate them myself. If you don't understand something, please tell me.
Thanks!
True or false
The size of the manipulated address by the processor determines the size of the virtual memory. How ever, the size of the memory cache is independent.
For long, DRAM technology stayed imcompatible with CMOS technology used to do the standard logic in processor. This is the reason DRAM memory is (most of the time) used outside of the processor (on a different chip).
Pagination let correspond multiple virtual addressing space to a same space of physical addressing.
An associative cache memory with sets of 1 line is an entierly associative cache memory, because one memory block can go in any set since each sets are of the same size that of the block.
"Manipulated address" is not a term of the art. You have an m-bit virtual address mapping to an n-bit physical address. Yes, a cache may be of any size up to the physical address size, but typically is much smaller. Note that cache lines are tagged with virtual or more typically physical address bits corresponding to the maximum virtual or physical address range of the machine.
Yes, DRAM processes and logic processes are each tuned for different objectives, and involve different process steps (different materials and thicknesses to lay down DRAM capacitor stacks/trenches, for example) and historically you haven't built processors in DRAM processes (except the Mitsubishi M32RD) nor DRAM in logic processes. Exception is so-called eDRAM that IBM likes to use for their SOI processes, and which is used as last level cache in IBM microprocessors such as the Power 7.
"Pagination" is what we call issuing a form feed so that text output begins at the top of the next page. "Paging" on the other hand is sometimes a synonym for virtual memory management, by which a virtual address is mapped (on a page by page basis) to a physical address. If you set up your page tables just so it allows multiple virtual addresses (indeed, virtual addresses from different processes' virtual address spaces) to map to the same physical address and hence the same location in real RAM.
"An associative cache memory with sets of 1 line is an entierly associative cache memory, because one memory block can go in any set since each sets are of the same size that of the block."
Hmm, that's a strange question. Let's break it down. 1) You can have a direct mapped cache, in which an address maps to only one cache line. 2) You can have a fully associative cache, in which an address can map to any cache line; there is something like a CAM (content addressible memory) tag structure to find which if any line matches the address. Or 3) you can have an n-way set associative cache, in which you have, essentially, n sets of direct mapped caches, and a given address can map to one of n lines. There are other more esoteric cache organizations, but I doubt you're being taught them.
So let's parse the statement. "An associative cache memory". Well that rules out direct mapped caches. So we're left with "fully associative" and "n-way set associative". It has sets of 1 line. OK, so if it is set associative, then instead of something traditional like 4-ways x 64 lines/way, it is n-ways x 1 lines/way. In other words, it is fully associative. I would say this is a true statement, except the term of the art is "fully associative" not "entirely associative."
Makes sense?
Happy hacking!
True, more or less (it depends on the accuracy of your translation I guess :) ) The number of bits in addresses sets an upper limit on the virtual memory space; you could, of course, choose not to use all the bits. The size of the memory cache depends on how much actual memory is installed, which is independent; but of course if you had more memory than you can address, then it still can't be used.
Almost certainly false. We have RAM on separate chips so that we can install more without building a whole new computer or replacing the CPU.
There is no a-priori upper or lower limit to the cache size, though in a real application certain sizes make more sense than others, of course.
I don't know of any incompatibility. The reason why we use SRAM as on-die cache is because it's faster.
Maybe you can force an MMUs to map different virtual addresses to the same physical location, but usually it's used the other way around.
I don't understand the question.