Apache Thrift maximum message size - message

We are using Apache Thrift to exchange messages between two systems. In one of the message we are exchanging a list (c++) which can become huge in size. Can you please let me know what is the maximum message size we can exchange using Apache Thrift?

There is no defined "per se" limit (at least none that I am aware of). It mostly depends on how the data are held in memory, what load is on the server and how much resources are avilable. For the most part, contiguos blocks of memory (RAM) will very likely become the scarcest resource, so we should focus on that point.
The "how the data are held in memory" refers to the fact that for the sake of better throughput some transports (buffered, framed) tend to allocate more memory and larger blocks than others. Depending on the language's implementation this process may be implemented more or less efficient in terms of memory cost.
If you really plan to transfer large blocks of data, you should also look at other options like
chunking the data into blocks
sending/returning only an URL or LAN share through the service, instead of the whole data

Related

CPUs in multi-core architectures and memory access

I wondered how memory access is handled "in general" if ,for example, 2 cores of CPU try to access memory at the same time (over the memory controller)? Actually the same applies when a core and an DMA-enabled IO device try to access in the same way.
I think, memory controller is smart enough to utilise the address bus and handle those requests concurrently, however I'm not sure what happens when they try to access to same location or when the IO operation monopolises the address bus and there's no room for CPU to move on.
Thx
The short answer is "it's complex, but access can certainly potentially occur in parallel in certain situations".
I think your question is a bit too black and white: you may be looking for an answer like "yes, multiple devices can access memory at the same time" or "no they can't", but the reality is that first you'd need to describe some specific hardware configuration, including some of the low-level implementation details and optimization features to get an exact answer. Finally you'd need to define exactly what you mean by "the same time".
In general, a good first-order approximation is that hardware will make it appear that all hardware can access memory approximately simultaneously, possibly with an increase in latency and a decrease in bandwidth due to contention. At the very fine-grained timing level access one device may indeed postpone access by another device, or it may not, depending on many factors. It is extremely unlikely you would need this information to implement software correctly, and quite unlikely you need to know the details even to maximize performance.
That said, if you really need to know the details, read on and I can give some general observations on some kind of idealized latpop/desktop/server scale hardware.
As Matthias mentioned, you first have to consider caching. Caching means that any read or write operation subject to caching (which includes nearly all CPU requests and many other types of requests as well) may not touch memory at all, so in that sense many cores can "access" memory (at least the cache image of it) simultaneous.
If you then consider requests that miss in all cache levels, you need to know about the configuration of the memory subsystem. In general a RAM chips can only do "one thing" at a time (i.e., commands1 such a read and write apply to the entire module) and that usually extends to DRAM modules comprised of several chips and also to a series of DRAMs connected via a bus to a single memory controller.
So you can say that electrically speaking, the combination of one memory controller and its attached RAM is likely to be doing only on thing at once. Now that thing is usually something like reading bytes out of a physically contiguous span of bytes, but that operation could actually help handle several requests from different devices at once: even though each devices sends separate requests to the controller, good implementations will coalesce requests to the same or nearby2 area of memory.
Furthermore, even the CPU may have such abilities: when a new request occurs it can/must notice that an existing request is in progress for an overlapping region and tie the new request to an old one.
Still, you can say that for a single memory controller you'll usually be serving the request of one device at a time, absent unusual opportunities to combine requests. Now the requests themselves are typically on the order of nanoseconds, so many separate requests can be served in a small unit of time, so this "exclusiveness" fine-grained and not generally noticeable3.
Now above I was careful to limit the discussion to a single memory-controller - when you have multiple memory controllers4 you can definitely have multiple devices accessing memory simultaneously even at the RAM level. Here each controller is essentially independent, so if the requests from two devices map to different controllers (different NUMA regions) they can proceed in parallel.
That's the long answer.
1 In fact, the command stream is lower level and more complex than things like "read" or "write" and involves concepts such as opening a memory page, streaming bytes from it, etc. What every programmer should know about memory serves as an excellent intro to the topic.
2 For example, imagine two requests for adjacent bytes in memory: it is possible the controller can combine them into a single request if they fit within the bus width.
3 Of course if you are competing for memory across several devices, the overall impact may be very noticeable: a reduction in per-device bandwidth and an increase in latency, but what I mean is that the sharing is fine-grained enough that you can't generally tell the difference between finely-sliced exclusive access and some hypothetical device which makes simultaneous progress on each request in each period.
4 The most common configuration on modern hardware is one memory controller per socket, so on a 2P system you'd usually have two controllers, also other rations (both higher and lower) are certainly possible.
There are dozens of things that come into play. E.g. on the lowest level there are bus arbitration mechanisms which allow that multiple participants can access a shared address and data bus.
On a higher level there are also things like CPU caches that need to be considered: If a CPU reads from memory it might only read from it's local cache, which might not reflect that state that exists in another CPU cores local cache. To synchronize memory between cache instances in multicore systems there exist cache coherence protocols which are are implemented in the CPUs. These have to guarantee that if one CPU writes to shared memory the caches of all other CPUs (which might also contain a copy of the memory locations content) get updated.

Kernel memory management: where do I begin?

I'm a bit of a noob when it comes to kernel programming, and was wondering if anyone could point me in the right direction for beginning the implementation of memory management in a kernel setting. I am currently working on a toy kernel and am doing a lot of research on the subject but I'm a bit confused on the topic of memory management. There are so many different aspects to it like paging and virtual memory mapping. Is there a specific order that I should implement things or any do's and dont's? I'm not looking for any code or anything, I just need to be pointed in the right direction. Any help would be appreciated.
There are multiple aspects that you should consider separately:
Managing the available physical memory.
Managing the memory required by the kernel and it's data structures.
Managing the virtual memory (space) of every process.
Managing the memory required by any process, i.e. malloc and free.
To be able to manage any of the other memory demands you need to know actually how much physical memory you have available and what parts of it are available to your use.
Assuming your kernel is loaded by a multiboot compatible boot loader you'll find this information in the multiboot header that you get passed (in eax on x86 if I remember correctly) from the boot loader.
The header contains a structure describing which memory areas are used and which are free to use.
You also need to store this information somehow, and keep track of what memory is allocated and freed. An easy method to do so is to maintain a bitmap, where bit N indicates whether the (fixed size S) memory area from N * S to (N + 1) * S - 1 is used or free. Of course you probably want to use more sophisticated methods like multilevel bitmaps or free lists as your kernel advances, but a simple bitmap as above can get you started.
This memory manager usually only provides "large" sized memory chunks, usually multiples of 4KB. This is of course of no use for dynamic memory allocation in style of malloc and free that you're used to from applications programming.
Since dynamic memory allocation will greatly ease implementing advanced features of your kernel (multitasking, inter process communication, ...) you usually write a memory manager especially for the kernel. It provides means for allocation (kalloc) and deallocation (kfree) of arbitrary sized memory chunks. This memory is from pool(s) that are allocated using the physical memory manager from above.
All of the above is happening inside the kernel. You probably also want to provide applications means to do dynamic memory allocation. Implementing this is very similar in concept to the management of physical memory as done above:
A process only sees its own virtual address space. Some parts of it are unusable for the process (for example the area where the kernel memory is mapped into), but most of it will be "free to use" (that is, no actually physical memory is associated with it). As a minimum the kernel needs to provide applications means to allocate and free single pages of its memory address space. Allocating a page results (under the hood, invisible to the application) in a call to the physical memory manager, and in a mapping from the requested page to this newly allocated memory.
Note though that many kernels provide its processes either more sophisticated access to their own address space or directly implement some of the following tasks in the kernel.
Being able to allocate and free pages (4KB mostly) as before doesn't help with dynamic memory management, but as before this is usually handled by some other memory manager which is using these large memory chunks as pool to provide smaller chunks to the application. A prominent example is Doug Lea's allocator. Memory managers like these are usually implemented as library (part of the standard library most likely) that is linked to every application.

How do memory access patterns affect memory bandwidth

I am involved in a computation that allocates about 1TByte of data in main memory.
I need to understand how much memory access patterns can affect the bandwidth to the processor.
For example I read that an Intel Xeon processor can support a memory bandwidth of 65GBytes/sec. Does this mean that this bandwidth will be achieved under all memory access patterns or only under optimal access patterns?
I know that on each request for data to main memory a whole cache line (64 Bytes) is pulled in. I consider the memory access pattern optimal, if on each request the entire cache line is used before the next request is issued. The worst case would be if on each request only one double (8 Bytes) is used before the next request is issued.
Suppose I know that on each request on average out of the 8 doubles that come in I will be using only f=1,2,...,8 of these before the next request is issued. Is there an easy way to compute the actual bandwidth I will get as a function of f?ยด
Thanks in advance for all replies

Data Storage with AWE Memory via collections / lists / other containers

Does anyone have any suggestions (product, toolsets, methods or other) for the storage and processing of custom data (delphi collections, binary trees, DIContainers etc) that DOES NOT restrict itself to a standard win32 memory address space? To put that in the extreme, is there anything off the shelf that can do the equivalent of holding a 10GB TList, thereby blowing the /3GB switch barrier and the 4GB 'windows on windows' limit?
What we ideally need is something that is pretty transparent to the Delphi application programmer, but allows very fast access to the data held in its structures, preferably via key lookup. The equivalent of a delphi colletion container would be fine, but its memory usage needs to be via AWE. It would also need to take care of mapping and unmapping the physical space it uses into the win32 process making use of it i.e. that would be the transaprent bit...
Moving the data into a database is not the answer - the information needs to remain memory resident for very fast access. The in-memory databases/tables that we've tried do not make use of AWE and also are slow at accessing. Our current Delphi data structures are fine, but straining the limits of win32 address space.
I'm going to be a complete dork, and tell you that I've made something even more advanced than what you're describing.... at work. So it's all closed source I'm afraid. Never saw anything like this anywhere. We combine VM, AWE, MMF and (soon) 32<>64 bit IPC into one big, mean data-processing machine, addressing up to 64 GB of memory, while processing hundreds of datasets, tens of GBs each.
But I can give you a few tips : AWE view-swapping is rather slow, because it forcibly pauses all running threads during the swap. Therefor, choose your window-sizes wisely (the smaller, the faster the swap - but call-overhead is lower with larger sizes ofcourse). We've settled with AWE view-sizes equal to the Windows default page-size (4 KB), but only because random-access performs best that way. Lineair data-access could run faster with bigger view-sizes.
Each view can map to any part of the allocated AWE memory, so one thing that can help is mapping only those pages into a view that need to be accessed - and try to save on unnessecary view-swaps (a priority-queue comes to mind).
Also, there should be a registration-mechanism somewhere in your design that handles the linkage between a view and the AWE memory behind this. And this better be thread-safe!
As for general usage : No, this doesn't fit in with regular Delphi classes. You should switch over to another concept altogether - and base your data-structures on that.
Anyway, good luck mate! You're going to need it... ;-)
There are system calls that can do this but it is not supported on all versions of Windows (in particular, Windows XP does not support AWE).
Transparency would be something of an issue as the API could not return pointers to objects. Mapping more than 4GB of RAM into a 4GB address space means that a 32 bit pointer could be ambiguous - you could potentially map different objects into the same location.
This ambiguity means that you would have to generate proxies for the objects which hold a handle that could be used to access the 'record'. Some SQL server versions use this technique to store disk buffers in AWE memory. An approach like this would probably work for something like rows in a matrix where the operations are done on the whole row. Finer grained access would be more fiddly.
In order to provide direct access to the mapped object you would have to implement a protocol where a temporary pointer to the mapped memory was made available. This would also require the object to be locked in memory while in use - again, bang goes your transparency.
Assuming you can get a 64 bit version of Delphi now you might be better off going to a 64 bit version of Windows for customers that need more RAM.
You state that you do not want to move to a database, but what about a database that specifically uses AWE?
I've not tried it personally, but would consider using products from this company for my own projects.
[Edit]: NexusDB is Delphi-friendly: it originated from the old Turbopower FlashFiler development (but has moved on a long way since then).
The issue with AWE it works very much alike the old, DOS-based EMS and XMS - if you ever used them. Basically, a range of addressable memory is reserved, and the memory outside the addressable range is then mapped to the addressable range when needed, and unmapped when no longer need, allowing other memory to be mapped at the same addresses. Thereby most non-AWE aware data structures or containers wouldn't work in such a scenario - probably a TMemoryStream descendant is easier to build. It should be easy enough to build a TList or the like that store data in AWE memory, it should keep track where the data are really stored and recall them when needed, adjusting addresses as well when data are mapped to addressable memory. I am not aware of any Delphi containers library using AWE, and there is another issue: desktop 32 bit operating systems can't use more than 4GB of physical RAM, a server version would be required, and the supported physical RAM depends on what version is used, see here for a complete list.
Assuming the data is loaded once in bulk and fits available memory, NexusDB AWE will be very very fast. The database can be created as an in-memory only DB and will then not need any further harddrive access while manipulating.
Sounds to me like you guys might consider dropping the current database SQL backend and going to a 100% NexusDB + AWE solution.
(Or rather, dropping the day to day access to the SQL backend, and having an export/sync function that can write out any required NexusDB reporting data to an MSSQL reporting db.)
W
Your situation sounds similar to ours, our application uses a huge datafile that we store in a memory-mapped file. The files are around 750MB, and we allocate data structures from them that use up to 1.5GB of RAM.
We have found no solution to the 4GB limit other than moving some of it off to FPC/Lazarus until Delphi is 64-bit, unfortunately. AWE does not work with Vista Home versions, also we couldn't get it to work with MMFs.
You could try memory-mapped files with a sliding window, meaning you dynamically create views of different chunks of the file depending on what part of it the application is using. Sounds like that won't work though because you need the entire file in memory at once.

How does shared memory vs message passing handle large data structures?

In looking at Go and Erlang's approach to concurrency, I noticed that they both rely on message passing.
This approach obviously alleviates the need for complex locks because there is no shared state.
However, consider the case of many clients wanting parallel read-only access to a single large data structure in memory -- like a suffix array.
My questions:
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?
One thing to realise is that the Erlang concurrency model does NOT really specify that the data in messages must be copied between processes, it states that sending messages is the only way to communicate and that there is no shared state. As all data is immutable, which is fundamental, then an implementation may very well not copy the data but just send a reference to it. Or may use a combination of both methods. As always, there is no best solution and there are trade-offs to be made when choosing how to do it.
The BEAM uses copying, except for large binaries where it sends a reference.
Yes, shared state could be faster in this case. But only if you can forgo the locks, and this is only doable if it's absolutely read-only. if it's 'mostly read-only' then you need a lock (unless you manage to write lock-free structures, be warned that they're even trickier than locks), and then you'd be hard-pressed to make it perform as fast as a good message-passing architecture.
Yes, you could write a 'server process' to share it. With really lightweight processes, it's no more heavy than writing a small API to access the data. Think like an object (in OOP sense) that 'owns' the data. Splitting the data in chunks to enhance parallelism (called 'sharding' in DB circles) helps in big cases (or if the data is on slow storage).
Even if NUMA is getting mainstream, you still have more and more cores per NUMA cell. And a big difference is that a message can be passed between just two cores, while a lock has to be flushed from cache on ALL cores, limiting it to the inter-cell bus latency (even slower than RAM access). If anything, shared-state/locks is getting more and more unfeasible.
in short.... get used to message passing and server processes, it's all the rage.
Edit: revisiting this answer, I want to add about a phrase found on Go's documentation:
share memory by communicating, don't communicate by sharing memory.
the idea is: when you have a block of memory shared between threads, the typical way to avoid concurrent access is to use a lock to arbitrate. The Go style is to pass a message with the reference, a thread only accesses the memory when receiving the message. It relies on some measure of programmer discipline; but results in very clean-looking code that can be easily proofread, so it's relatively easy to debug.
the advantage is that you don't have to copy big blocks of data on every message, and don't have to effectively flush down caches as on some lock implementations. It's still somewhat early to say if the style leads to higher performance designs or not. (specially since current Go runtime is somewhat naive on thread scheduling)
In Erlang, all values are immutable - so there's no need to copy a message when it's sent between processes, as it cannot be modified anyway.
In Go, message passing is by convention - there's nothing to prevent you sending someone a pointer over a channel, then modifying the data pointed to, only convention, so once again there's no need to copy the message.
Most modern processors use variants of the MESI protocol. Because of the shared state, Passing read-only data between different threads is very cheap. Modified shared data is very expensive though, because all other caches that store this cache line must invalidate it.
So if you have read-only data, it is very cheap to share it between threads instead of copying with messages. If you have read-mostly data, it can be expensive to share between threads, partly because of the need to synchronize access, and partly because writes destroy the cache friendly behavior of the shared data.
Immutable data structures can be beneficial here. Instead of changing the actual data structure, you simply make a new one that shares most of the old data, but with the things changed that you need changed. Sharing a single version of it is cheap, since all the data is immutable, but you can still update to a new version efficiently.
What is a large data structure?
One persons large is another persons small.
Last week I talked to two people - one person was making embedded devices he used the word
"large" - I asked him what it meant - he say over 256 KBytes - later in the same week a
guy was talking about media distribution - he used the word "large" I asked him what he
meant - he thought for a bit and said "won't fit on one machine" say 20-100 TBytes
In Erlang terms "large" could mean "won't fit into RAM" - so with 4 GBytes of RAM
data structures > 100 MBytes might be considered large - copying a 500 MBytes data structure
might be a problem. Copying small data structures (say < 10 MBytes) is never a problem in Erlang.
Really large data structures (i.e. ones that won't fit on one machine) have to be
copied and "striped" over several machines.
So I guess you have the following:
Small data structures are no problem - since they are small data processing times are
fast, copying is fast and so on (just because they are small)
Big data structures are a problem - because they don't fit on one machine - so copying is essential.
Note that your questions are technically non-sensical because message passing can use shared state so I shall assume that you mean message passing with deep copying to avoid shared state (as Erlang currently does).
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
Using shared state will be a lot faster.
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Either approach can be used.
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?
Copying is cache unfriendly and, therefore, destroys scalability on multicores because it worsens contention for the shared resource that is main memory.
Ultimately, Erlang-style message passing is designed for concurrent programming whereas your questions about throughput performance are really aimed at parallel programming. These are two quite different subjects and the overlap between them is tiny in practice. Specifically, latency is typically just as important as throughput in the context of concurrent programming and Erlang-style message passing is a great way to achieve desirable latency profiles (i.e. consistently low latencies). The problem with shared memory then is not so much synchronization among readers and writers but low-latency memory management.
One solution that has not been presented here is master-slave replication. If you have a large data-structure, you can replicate changes to it out to all slaves that perform the update on their copy.
This is especially interesting if one wants to scale to several machines that don't even have the possibility to share memory without very artificial setups (mmap of a block device that read/write from a remote computer's memory?)
A variant of it is to have a transaction manager that one ask nicely to update the replicated data structure, and it will make sure that it serves one and only update-request concurrently. This is more of the mnesia model for master-master replication of mnesia table-data, which qualify as "large data structure".
The problem at the moment is indeed that the locking and cache-line coherency might be as expensive as copying a simpler data structure (e.g. a few hundred bytes).
Most of the time a clever written new multi-threaded algorithm that tries to eliminate most of the locking will always be faster - and a lot faster with modern lock-free data structures. Especially when you have well designed cache systems like Sun's Niagara chip level multi-threading.
If your system/problem is not easily broken down into a few and simple data accesses then you have a problem. And not all problems can be solved by message passing. This is why there are still some Itanium based super computers sold because they have terabyte of shared RAM and up to 128 CPU's working on the same shared memory. They are an order of magnitude more expensive then a mainstream x86 cluster with the same CPU power but you don't need to break down your data.
Another reason not mentioned so far is that programs can become much easier to write and maintain when you use multi-threading. Message passing and the shared nothing approach makes it even more maintainable.
As an example, Erlang was never designed to make things faster but instead use a large number of threads to structure complex data and event flows.
I guess this was one of the main points in the design. In the web world of google you usually don't care about performance - as long as it can run in parallel in the cloud. And with message passing you ideally can just add more computers without changing the source code.
Usually message passing languages (this is especially easy in erlang, since it has immutable variables) optimise away the actual data copying between the processes (of course local processes only: you'll want to think your network distribution pattern wisely), so this isn't much an issue.
The other concurrent paradigm is STM, software transactional memory. Clojure's ref's are getting a lot of attention. Tim Bray has a good series exploring erlang and clojure's concurrent mechanisms
http://www.tbray.org/ongoing/When/200x/2009/09/27/Concur-dot-next
http://www.tbray.org/ongoing/When/200x/2009/12/01/Clojure-Theses

Resources