What is size of the largest message real-logic's Aeron can process?
I realize Aeron shines for small messages, however I would like to have a single protocol throughout our stack and some of our messages easily reach a size of 100Mb.
The documentation is not clear on what settings affect the answer to this question. My worry is that the default buffer settings don't allow messages of this size. Or do the buffer settings have no impact on maximum application message size?
We discussed this issue on a recent conference with Martin Thompson, main contributor of Aeron.
Not only it is not possible to exceed the size of the page, if you take a look at the Aeron presentation slides 52 to 54, these are main reasons not to push large files:
page faults are almost guaranteed, if message size is close to the
page size, and it will make Aeron repeat the sequence with the next
page, i.e. major slowdown
cache churn
VM pressure
Long story short, split your 100M message into smaller chunks, at most 1/4 of the page size minus header, if you use IPC; or the size of MTU minus header (maximum transfer unit), if you go via UDP. It will improve throughput, remove pipe clogging and fix other issues. The logic behind MTU is, occasionally UDP packets get lost and everything is re-sent by Aeron starting with lost packet. So, you want to have your chunks fit a single packet for performance.
For maximum performance, in your application-level protocol, I would create a first packet having file metadata and length, the receiving side then creates memory-mapped file on HDD of given length and fills it up with incoming chunks, where each chunk includes offset and file ID (in case you want to push multiple files simultaneously). This way you copy from Aeron buffer to mmap and completely avoid garbage collection. I expect the performance of such file transfer to be faster than most other options.
To answer the original question:
With default settings, maximum message size is 16MB minus 32 bytes header.
To change it, on your media driver you can try changing
aeron.term.buffer.length and aeron.term.buffer.max.length system properties (see Aeron Configuration Options). Here, term buffer corresponds to a "page" in other parts of documentation and in fact, works similarly to OS memory pages in some respects. While term.buffer.max.length configures number of buffers/pages in rotation.
Related
I am frankly stumped. This is beyond my experience.
I have a C# MVC program that generates a zip file in a MemoryStream for downloading. The action method is called by a button click to JavaScript.
The only problem is that in some cases the potential file size can easily exceed one Gig and from my reading, that is a common problem. I've tried upping the Maximum Allowed Content Length to 3000000000 in Request Filtering on IIS (IIS8). I've tried adding requestLimits maxAllowedContentLength to my web.config. I've even tried breaking up the zip through multiple calls to the action method (without success), although I have yet to get any confirmation/denial that this is even possible.
Is there any setting within IIS or my web.config that I could be overlooking? Could this be a company network issue, not solvable on an app developer's level?
Okay, so it's kind of hard to explain big concepts in 400 characters or less, so I think I'm just causing more confusion sticking in the comments section. Besides, I think we're close enough here to an "answer" as you're likely to get.
The default constructor of MemoryStream essentially sets the initial size to 0. In reality, the initial size is set to somewhere around 256, but since the initial size is mostly a guide, and it doesn't actually claim that space until its needed, it starts at 0.
Each time you write to the stream, it checks how much is being written versus the remaining size of the buffer array. If it can't fit the write, it creates a new, larger buffer array and copies the old buffer array into that. In this way, setting an initial size can help somewhat, in that you start off with a larger initial buffer array and you may not need to grow that buffer. You might have a better chance of getting a contiguous block of memory, which I'll explain the importance of in a bit, but that actually kind of works against you, as well. If you only need 1MB for the file, but you're initializing with 100MB and there's not 100MB of contiguous memory, you'll get an OutOfMemoryException, even though there might be 1MB of contiguous memory available.
Regardless of whether you initialize or not, there remains certain immutable facts. First, MemoryStream requires contiguous memory. Even if you technically have memory available on the system, it's possible you might not have large blocks of available memory. In other words if you have 4GB available, but it's all fragmented, even trying to create a 1GB stream in memory could fail, simply because it can't reserve 1GB of contiguous memory. Obviously, the larger the file you're tying to create in memory, the greater the chances that you're going to run into this issue. For this reason alone, I would say you're out of luck without raising the amount of system RAM. With 8GB and probably only 4-6GB actually available to IIS and then split up between worker processes and threads, the odds that you're going to be able to claim 25% or so of the available RAM as contiguous space, is highly unlikely.
The next immutable fact may or may not be relevant, but since you haven't specified, I'll mention it. If your web app is deployed as 32-bit, you'll have a hard limit of 2GB for any object, meaning a MemoryStream could never house more than 2GB (actually around 1.3-1.6GB as .NET code consumes some of that address space), and any attempt to make it do so will result in an OutOfMemoryException, even if you had some ridiculous amount of RAM on the system like 1TB+. If your app is 64-bit, this is less likely an issue as you can address a ton more memory, assuming it's compiled properly. You'd have to pretty much try to screw that up, though, so you should be fine.
Finally, multiple writes can cause an issue as well. As I said previously, the buffer array resizes (if necessary) in response to writes. Each time it resizes, the new buffer array must also be able to fit in contiguous address space. As a result, multiple resizes can cause you to bump into an OutOfMemoryException you wouldn't have hit if you had written all the data from the start. This is where initializing the MemoryStream can be helpful, but as I said before, it's also a double-edged sword, as your initial buffer size might be too great to begin with and you end up with an exception where you may have not had one letting it grow organically. Long and short, try to write everything to the stream in one go rather than piecemeal.
X86 and x64 processors allow for 1GB pages when the PDPE flag is set on the cpu. In what application would this be practical or required and for what reason?
Hugepage would help in cases where you have a large memory footprint and memory access pattern spans large distance (across 4K pages).
It not only reduces TLB miss but also saves OS mm system page tables size.
A very good example is packet processing. In high throughput network applications (1Gbps or more), packets are normally stored in a packet buffer pool (i.e. pooling technique). For example, every packet buffer is 2KB in size and the pool contains 512 buffers. Access pattern of this packet buffer pool might not be sequential (buffer indexed at 1,2,3,4,5...) but rather random over time (1,104,407,45,905...). Since normal page size is 4K, normal TLB won't help here since each packet access would incur a TLB miss and there is a lot of different buffers sitting on different pages.
In contrast, if you put the pool in a 1GB hugepage, then all packet buffers share the same hugepageTLB entry thus avoiding misses.
This is used in DPDK (Data Plane Development Kit) where the packet
rate is very high that cycles wasted on TLB miss is not negligible.
Hugepage support is required for the large memory pool allocation used
for packet buffers (the HUGETLBFS option must be enabled in the
running kernel as indicated the previous section). By using hugepage
allocations, performance is increased since fewer pages are needed,
and therefore less Translation Lookaside Buffers (TLBs, high speed
translation caches), which reduce the time it takes to translate a
virtual page address to a physical page address. Without hugepages,
high TLB miss rates would occur with the standard 4k page size,
slowing performance.
http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html#bios-setting-prerequisite-on-x86
Another example from Oracle:
...almost 6.8 GB of memory used for page tables when hugepages were not
configured...
...after hugepages were allocated and used by the Oracle database. The page table overhead was reduced to slightly less than 23 MB
http://www.databasejournal.com/features/oracle/understanding-hugepages-in-oracle-database.html
Related links:
https://en.wikipedia.org/wiki/Object_pool_pattern
--Edit--
However, hugepage should be used carefully. Above I mentioned that memory pool would benefit from 1GB hugepage. However, if you have an access pattern even across 1GB page boundary, then it might not help. There is an excellent blog on this:
http://www.pvk.ca/Blog/2014/02/18/how-bad-can-1gb-pages-be/
Imagine an application that uses huge amounts of memory—Molecular modeling. Weather prediction—especially if it has no user interaction.
Large pages:
(1) reduce the amount of page table overhead memory
(2) increases the amount of memory that can be stored in the MMU cache. (The same number of cache entries references more memory).
I have LabView installed on my Dell ws with 8 cores and 16GB DDRM, driving 4 24" monitors.If I create a video processor or compositor of most any type, with a 1024px x 1024px 'drawing' display, LabView reserves 1.5GB before I even began to composite. It was built from C and C++. I often store image details in 3D arrays of 256 x 256 x 256 of U32 integers that hold each RGB pixel color, plus the alpha channel for opacity or masking. That's 64MB per each layer of buffered video. If I need to remember 128 layers, thats 8GB right there. LabView is a programming langauge structured much like a CAD program. If I need 8GB for a series of video (HDTV) buffers, that is what it will give me, with a few seconds wait for malloc to do its work. If I created a 8GB 3D array for a database, it would be no different, even if I did it in MySQL (not as an array). To me, having many gigabytes of ram to play with is the norm, not an exception.
"a smaller main memory per node may reduces the amount of computation a node can
execute without internode communication, increases the frequency of communication, and reduces the sizes of the respective messages communicated."
I don't understand this sentence above, could you help me explain it or restate it again?
This clearly refers to distributed computing.
a smaller main memory per node may reduces the amount of computation a node can execute without internode communication,
The less memory a node have, the less data it holds and less work generally have to be done on this data. The more memory it haves, the more work it has to do before synchronizing the other nodes to request more work or send the result in order to achieve some greater common deduction.
increases the frequency of communication,
When less memory is available, the node will have to request smaller work chunks (because it can't hold any more data), and thus, have more requests.
and reduces the sizes of the respective messages communicated
When less memory is available, the node's work chunks are smaller and thus the size of the messages carrying this work will be smaller.
For example
Let's assume you want to do a distributed "grep" on a file with 1,000 lines, on a cluster with 2 machines.
First Scenario
If the machines have the capacity of 1 line, then each machine will have to communicate 500 messages (send and return).
Second Scenario
If the machines have the capacity of 500 lines, then each machine will receive one message of size 500, but will return a single message with all the lines that contain the requested word.
The first scenario had less computation per communication, had a higher frequency of message passing, but smaller messages to send/receive.
In my application I need to read data from an input stream. I have set the current buffer size for reading as 1024. But I have seen in some Android applications buffer size has been kept as 8192 (8 KB). Will there be any specific advantage if I increase the buffer size in my application to 8KB?
Any expert opinion will be much appreciated.
Edit: (I am using BB OS 6 and 7 and I am dealing with network inputstream.)
I can't say that I've found the universally best buffer size, but it seems to me that something in the range of 1KB to 8KB should be fine in most situations (for BlackBerry Java apps).
Keep in mind that if the amount of data is small (so you'd probably only need one or two buffers at 1KB-8KB), it's probably best just to use the IOUtilities method:
byte[] result = IOUtilities.streamToBytes(inputStream);
with which you don't need to actually pick a buffer size. But, if you know that result would be a large block of data, you're probably right in wanting to read one buffer at a time.
However, I would argue that the answer should almost always be obtained simply by building the app, and measuring performance with a few different values for byte buffer size. It's easy enough to change one constant, build, run and measure again, and then you're not guessing, or taking the advice of someone who doesn't know all the details of your app.
See here for information about BlackBerry Eclipse plugin memory analysis, and
here for BlackBerry Eclipse plugin profiling.
These tools are found in Eclipse by selecting the Window menu, then Show View -> Other... -> BlackBerry -> BlackBerry Memory Statistics View, or BlackBerry Profiler View, while debugging.
This way, you can see how much memory, or processor, the network code is using during the call to retrieve data and populate your buffer.
More
BlackBerry InputStream to String conversion
This question was also asked in the official BlackBerry forum here:
http://supportforums.blackberry.com/t5/Java-Development/What-is-the-best-size-for-a-buffer-in-BlackBerry/td-p/2559417
The OP gave this clarification:
"I am reading from network. Once I establish socket connection with the server, the server will send me notifications one after the other. I need to read the notifications/data from the inputstream available in the socket connection. For this I have a background thread which checks anything is available in the inputstream and if something is available, it will read with the help of a buffer and then passes the read data to a StringBuffer."
Given this information, I have a different take, in that I think the BlackBerry network handling abstracts the Java application from the network buffer processing to the extent that the application buffer size will have little if any impact on the performance.
But be aware, this is only my opinion.
My response on that thread was as follows:
First thing to note is that the method "isAvailable()", in my experience, does not work correctly on OS 5.0 and earlier. It is fixed in OS 6 (at least from my testing).
Because isAvailable() was broken, (and for other application reasons) what I have implemented for a socket connection is that each message is preceded by a length. So in the socket connection, I read the length of the next message, and then the actual data. This is done with no blocking - in other words I read the entire message, regardless of size. I recommend you do the same. The message must exist in full somewhere so it makes no difference if it is in some memory managed by the socket connection, or in some memory managed by you.
Note also, until OS 6.0, when you did the read you would get all the data to fill the buffer you had - in other words it waited till the buffer was full. In OS 6.0 and later, the read can complete without giving you a full buffer.
In your case, you might be working in a post OS 6.0 only, so you could use isAvailable() - create a buffer of that size, and read everything. I can't see that it makes any difference whether you have the bytes in memory managed by the socket, or memory managed by you.
But in fact, I would argue that the best approach is the one that makes your processing simplest. So for example, if you know that the next message is 200 bytes, then read 200 bytes, and then process that message. Then read the next message.
You could spend a lot of time attempting to manage the buffers to match the underlying socket buffers. I don't know exactly how the underlying BlackBerry socket processing code works, but it doesn't put data directly into your buffers. So let it manage its buffer size to optimize the network, you manage your buffer size to optimize your processing. That will work best for everyone.
Paging is explained here, slide #6 :
http://www.cs.ucc.ie/~grigoras/CS2506/Lecture_6.pdf
in my lecture notes, but I cannot for the life of me understand it. I know its a way of translating virtual addresses to physical addresses. So the virtual addresses, which are on disks are divided into chunks of 2^k. I am really confused after this. Can someone please explain it to me in simple terms?
Paging is, as you've noted, a type of virtual memory. To answer the question raised by #John Curtsy: it's covered separately from virtual memory in general because there are other types of virtual memory, although paging is now (by far) the most common.
Paged virtual memory is pretty simple: you split all of your physical memory up into blocks, mostly of equal size (though having a selection of two or three sizes is fairly common in practice). Making the blocks equal sized makes them interchangeable.
Then you have addressing. You start by breaking each address up into two pieces. One is an offset within a page. You normally use the least significant bits for that part. If you use (say) 4K pages, you need 12 bits for the offset. With (say) a 32-bit address space, that leaves 20 more bits.
From there, things are really a lot simpler than they initially seem. You basically build a small "descriptor" to describe each page of memory. This will have a linear address (the address used by the client application to address that memory), and a physical address for the memory, as well as a Present bit. There will (at least usually) be a few other things like permissions to indicate whether data in that page can be read, written, executed, etc.
Then, when client code uses an address, the CPU starts by breaking up the page offset from the rest of the address. It then takes the rest of the linear address, and looks through the page descriptors to find the physical address that goes with that linear address. Then, to address the physical memory, it uses the upper 20 bits of the physical address with the lower 12 bits of the linear address, and together they form the actual physical address that goes out on the processor pins and gets data from the memory chip.
Now, we get to the part where we get "true" virtual memory. When programs are using more memory than is actually available, the OS takes the data for some of those descriptors, and writes it out to the disk drive. It then clears the "Present" bit for that page of memory. The physical page of memory is now free for some other purpose.
When the client program tries to refer to that memory, the CPU checks that the Present bit is set. If it's not, the CPU raises an exception. When that happens, the CPU frees up a block of physical memory as above, reads the data for the current page back in from disk, and fills in the page descriptor with the address of the physical page where it's now located. When it's done all that, it returns from the exception, and the CPU restarts execution of the instruction that caused the exception to start with -- except now, the Present bit is set, so using the memory will work.
There is one more detail that you probably need to know: the page descriptors are normally arranged into page tables, and (the important part) you normally have a separate set of page tables for each process in the system (and another for the OS kernel itself). Having separate page tables for each process means that each process can use the same set of linear addresses, but those get mapped to different set of physical addresses as needed. You can also map the same physical memory to more than one process by just creating two separate page descriptors (one for each process) that contain the same physical address. Most OSes use this so that, for example, if you have two or three copies of the same program running, it'll really only have one copy of the executable code for that program in memory -- but it'll have two or three sets of page descriptors that point to that same code so all of them can use it without making separate copies for each.
Of course, I'm simplifying a lot -- quite a few complete (and often fairly large) books have been written about virtual memory. There's also a fair amount of variation among machines, with various embellishments added, minor changes in parameters made (e.g., whether a page is 4K or 8K), and so on. Nonetheless, this is at least a general idea of the core of what happens (and it's still at a high enough level to apply about equally to an ARM, x86, MIPS, SPARC, etc.)
Simply put, its a way of holding far more data than your address space would normally allow. I.e, if you have a 32 bit address space and 4 bit virtual address, you can hold (2^32)^(2^4) addresses (far more than a 32 bit address space).
Paging is a storage mechanism that allows OS to retrieve processes from the secondary storage into the main memory in the form of pages. In the Paging method, the main memory is divided into small fixed-size blocks of physical memory, which is called frames. The size of a frame should be kept the same as that of a page to have maximum utilization of the main memory and to avoid external fragmentation.