Why do you use `stream` for GRPC/protobuf file transfers? - stream

I've seen a couple examples like this:
service Service{
rpc updload(stream Data) returns (google.protobuf.Empty) {};
rpc download(google.protobuf.Empty) returns (stream Data) {};
}
message Data { bytes bytes = 1; }
What is the purpose of using stream, does it make the transfer more efficient?
In theory - yes - I obviously wan't to stream my file transfers but that's what happens over a connection... So, what is the actual benefit to this keyword, does it enforce some form of special buffering to reduce some overhead? Either way, the data is being transmitted, in full!

It's more efficient because, within a single call, multiple messages may be sent.
This avoids, not only re-establishing another (hopefully TLS i.e. even more work) connection with the server but also avoids spinning up client and server "stubs"; both the client and server are ready for more messages.
It's somewhat similar to being connected on a telephone call with your friend who, before hanging up, says "Oh, another thing...". Instead of hanging up the call and then, 5 minutes later, calling you back, interrupting dinner and causing you to pause a movie.

The answer is very similar to the gRPC + Image Upload question, although from a different perspective.
Doing a large download (10+ MB) as a single response message puts strong limits on the size of that download, as the entire response message is sent and processed at once. For most use cases, it is much better to chunk a 100 MB file into 1-10 MB chunks than require all 100 MB to be in memory at once. That also allows the downloader to begin processing the file before the entire file is acquired which reduces processing latency.
Without streaming, chunking would require multiple RPCs, which are annoying to coordinate and have performance complications. Because there is latency to complete RPCs, for reasonable performance you either have to do many RPCs in parallel (but how many?) or have a large batch size (but how big?). Multiple RPCs can also hit colder application caches, as each RPC goes to a different backend.
Using streaming provides the same throughput as the non-chunking approach without as many headaches of normal chunking approaches. Since streaming is pipelined (server can start sending next chunk as soon as previous chunk is sent) there's no added per-chunk latency between the client and server. This makes it much easier to choose a chunk size, as there is a wide range of "reasonable" sizes that will behave similarly and the system will naturally react as network performance varies.
While sending a message on an existing stream has less overhead than creating a new RPC, for many users the difference is negligible and it is generally better to structure your RPCs in a way that is architecturally beneficial to your application and not just to eek out small optimizations in gRPC. The reason to use the stream in this case is to make your application perform better at lower complexity.

Related

AVX2 Streaming Stores Do Not Improve Performance

I have an AVX2 implementation of some workload.
I have determined that the vast majority of the execution time is occupied
by the memory loads and stores.
In an attempt to improve performance, I tried to change the conventional stores
to streaming (non-temporal) stores.
However, this change had little to no positive performance impact (I was expecting a sizeable performance increase).
What could be the reason for this?
The use of streaming stores can lead to a better performance under some circumstances:
The data "to be stored" is not read before writing: Streaming stores are write-through, which produces immediate bus traffic. The standard store uses a write back strategy which may delay the bus operation until a later time and avoids bus operations with multiple writes to the same cache line.
The time used for stores is smaller than the time used for calculation: A streaming store has to be finished before the next streaming store can be issued. Thus, ahving too liitle computation in between two streaming stores leads to some idle time for the processor in which no further computation can be executed. Where this problem may also be possible with standard stores, streaming stores even increase it.
The data "to be stored" is not needed shortly after being written: The streaming store surpasses caches while writing/storing. Thus, there is no copy of the data in the cache. When reading the data aftwerwards the data has to be loaded into the cache. Thus, you have no gain over a standard store. However, when using a standard store, the data is loaded into the cache, modified there, and maybe still there when a later access happens.
So you have to consider your code and problem, to these circumstances to know if streaming stores are worth a try. In an unfitting scenario your performance might even drop.
A blog entry with additional info and a benchmark can be found e.g. here.

CPUs in multi-core architectures and memory access

I wondered how memory access is handled "in general" if ,for example, 2 cores of CPU try to access memory at the same time (over the memory controller)? Actually the same applies when a core and an DMA-enabled IO device try to access in the same way.
I think, memory controller is smart enough to utilise the address bus and handle those requests concurrently, however I'm not sure what happens when they try to access to same location or when the IO operation monopolises the address bus and there's no room for CPU to move on.
Thx
The short answer is "it's complex, but access can certainly potentially occur in parallel in certain situations".
I think your question is a bit too black and white: you may be looking for an answer like "yes, multiple devices can access memory at the same time" or "no they can't", but the reality is that first you'd need to describe some specific hardware configuration, including some of the low-level implementation details and optimization features to get an exact answer. Finally you'd need to define exactly what you mean by "the same time".
In general, a good first-order approximation is that hardware will make it appear that all hardware can access memory approximately simultaneously, possibly with an increase in latency and a decrease in bandwidth due to contention. At the very fine-grained timing level access one device may indeed postpone access by another device, or it may not, depending on many factors. It is extremely unlikely you would need this information to implement software correctly, and quite unlikely you need to know the details even to maximize performance.
That said, if you really need to know the details, read on and I can give some general observations on some kind of idealized latpop/desktop/server scale hardware.
As Matthias mentioned, you first have to consider caching. Caching means that any read or write operation subject to caching (which includes nearly all CPU requests and many other types of requests as well) may not touch memory at all, so in that sense many cores can "access" memory (at least the cache image of it) simultaneous.
If you then consider requests that miss in all cache levels, you need to know about the configuration of the memory subsystem. In general a RAM chips can only do "one thing" at a time (i.e., commands1 such a read and write apply to the entire module) and that usually extends to DRAM modules comprised of several chips and also to a series of DRAMs connected via a bus to a single memory controller.
So you can say that electrically speaking, the combination of one memory controller and its attached RAM is likely to be doing only on thing at once. Now that thing is usually something like reading bytes out of a physically contiguous span of bytes, but that operation could actually help handle several requests from different devices at once: even though each devices sends separate requests to the controller, good implementations will coalesce requests to the same or nearby2 area of memory.
Furthermore, even the CPU may have such abilities: when a new request occurs it can/must notice that an existing request is in progress for an overlapping region and tie the new request to an old one.
Still, you can say that for a single memory controller you'll usually be serving the request of one device at a time, absent unusual opportunities to combine requests. Now the requests themselves are typically on the order of nanoseconds, so many separate requests can be served in a small unit of time, so this "exclusiveness" fine-grained and not generally noticeable3.
Now above I was careful to limit the discussion to a single memory-controller - when you have multiple memory controllers4 you can definitely have multiple devices accessing memory simultaneously even at the RAM level. Here each controller is essentially independent, so if the requests from two devices map to different controllers (different NUMA regions) they can proceed in parallel.
That's the long answer.
1 In fact, the command stream is lower level and more complex than things like "read" or "write" and involves concepts such as opening a memory page, streaming bytes from it, etc. What every programmer should know about memory serves as an excellent intro to the topic.
2 For example, imagine two requests for adjacent bytes in memory: it is possible the controller can combine them into a single request if they fit within the bus width.
3 Of course if you are competing for memory across several devices, the overall impact may be very noticeable: a reduction in per-device bandwidth and an increase in latency, but what I mean is that the sharing is fine-grained enough that you can't generally tell the difference between finely-sliced exclusive access and some hypothetical device which makes simultaneous progress on each request in each period.
4 The most common configuration on modern hardware is one memory controller per socket, so on a 2P system you'd usually have two controllers, also other rations (both higher and lower) are certainly possible.
There are dozens of things that come into play. E.g. on the lowest level there are bus arbitration mechanisms which allow that multiple participants can access a shared address and data bus.
On a higher level there are also things like CPU caches that need to be considered: If a CPU reads from memory it might only read from it's local cache, which might not reflect that state that exists in another CPU cores local cache. To synchronize memory between cache instances in multicore systems there exist cache coherence protocols which are are implemented in the CPUs. These have to guarantee that if one CPU writes to shared memory the caches of all other CPUs (which might also contain a copy of the memory locations content) get updated.

Data Retrieval Throughput - ETS lookup vs inter-process Messaging

suppose we have an erlang application which involves thousands of processes. Suppose there is a single resource X which may be a tuple, a list, or any erlang term, which all these processes may need to read / pick out something from it, at any moment in time.
An example of such an occurrence, is say, an API system, in which client processes may need to read and write on a remote machine. Ant it happens that you do not want, for each read/write request, a new connection to be created. So, what you do, you create a pool of connections, consider them as a pool of open pipes/sockets/channels.
Now, this pool of resources is to be shared by thousands of processes such that for each read or write demand, you want that process to retrieve any available open channel/resource.
Question is, what if i have a process (a single process) hold this information, whether in its process dictionary or in its receive loop. It would mean that all the processes would have to send a message to this process whenever they need a free resource. This single process would have a huge mailbox at any time because of the high demand for this single resource. OR I could use an ETS Table, and have only one row, say, #resources{key=pool,value= List_of_openSockets_or_channels}. But this would mean that, all our processes would attempt to make a read from the ETS Table for the same row at (high probability) same instantaneous times.
How would the ETS Table handle, if 10,000 process atttempt a read, for the same row/record from it, at the same time/at almost same time ? and yet, if i use a process, its mailbox, if 10,000 processes send a message to it, at same time, for the same resource (and it would need to reply each requestor). And remember this action may occur so frequently. What option (dis-regarding availability issues of process going down blah blah), would provide higher throughput, in a way that, processes would get what they need faster ? Is there any other better way, of handling high demand data structures in the Erlang VM in a way that will provide very fast access to millions of processes, even if they all needed that resource at the same time ?
Short answer: profile. Try different approaches and verify how your system behaves.
Firstly, I would look at ETS' {read_concurrency, true} option. From the documentation:
{read_concurrency,boolean()} Performance tuning. Default is false.
When set to true, the table is optimized for concurrent read
operations. When this option is enabled on a runtime system with SMP
support, read operations become much cheaper; especially on systems
with multiple physical processors. However, switching between read and
write operations becomes more expensive. You typically want to enable
this option when concurrent read operations are much more frequent
than write operations, or when concurrent reads and writes comes in
large read and write bursts (i.e., lots of reads not interrupted by
writes, and lots of writes not interrupted by reads). You typically do
not want to enable this option when the common access pattern is a few
read operations interleaved with a few write operations repeatedly. In
this case you will get a performance degradation by enabling this
option. The read_concurrency option can be combined with the
write_concurrency option. You typically want to combine these when
large concurrent read bursts and large concurrent write bursts are
common.
Secondly, I would look at caching possibilities. Are the processes reading that information only once or multiple times? If they're accessing it multiple times, you could read it once and store it in your process state.
Thirdly, you could try to replicate and distribute that piece of information across your system. Divide et impera.
If you use the process approach, in order to avoid having all the read requests serialized on the message queue of the 'server' process you must replicate.
Using an ETS table with read_concurrency feels more natural and it is something that I used when developing the parallel version of Dialyzer. However, ETS access was never a bottleneck in that case.

Multiple hits to an API bringing server to it's knees

I am using an API (Let's pretend its facebook) to gather data between two given dates. Because of API restrictions (like most) I can only grab so many at a time, and therefor have to page my way through the results.
Here is my issue/question though.. Is it better to
get fewer results back, and make more calls to the api
get more results back, and fewer calls to the api
I am running a 4GB instance of a cloud server..
The data I'm looking at is in XML format, and contains about 20k entries. Each entry contains probably another 20 tags within it. Once completely pulled down the data ends up being about 10MB.. my problem is that when my server is hitting the api, gathering this information the CPU and Memory spike to nearly 100%. I've tried retrieving 500 at a time, 1000 at a time, 5000 at a time.. is this something where I need to gather 20 at a time.. or is there something else I should look at?
I'm not sure what else to provide, if there is something I can provide just let me know
Updates based on answers
I host with Storm on Demand, which runs perfectly for us and seems to be great hardware - https://www.stormondemand.com/cloud-server/
I use HPricot to parse the XML (which could probably be optimized, I'm no expert here)
I do need all of the data, this service doesn't offer an export, only API.
EDIT [to help people stumbling on this later]
I switched from Hpricot to Nokogiri, MUCH faster.
Also, I was building an XML file in memory, apparently that is extremely intense, and was a very time consuming task. I've cut this operation down from about 10 minutes, to just over 1 minute by fixing these two things.
Here's a list of things to look at:
optimize your code. try profiling your code and see if you can improve it. Mast likely using a better parser (DOM vs SAX) is possible.
get a better hardware/hosting. 4GB is just memory. Most likely you are on a shared hosting/vm and CPU limited
offload some CPU/memory heavy operations to a faster service/application, like XML processing, data analysis, file io can be done in C/C++
in a proper cloud environment you should be able to spawn more VMs and adjust your jobs/load accordingly. That will cost more tough and require some kind of job manager.
The questions you need to ask is why is your CPU+ memory spiking? 4GB is plenty to be handling this data, so is your code optimized to handle this task? If not, what can you do?
Is your code optimized enough? Fair enough. You can now rewrite them using C extensions.
After optimizing your code, I'd suggest checking out processing this data 'later', as in a delayed job. This way you aren't blocking on the entire dataset which may strain your server.
You also mentioned you are running a cloud server, which I can assume you have access to more Virtual Machines. You can process this data in pararel to reduce stress per machine.

OutOfMemoryException Processing Large File

We are loading a large flat file into BizTalk Server 2006 (Original release, not R2) - about 125 MB. We run a map against it and then take each row and make a call out to a stored procedure.
We receive the OutOfMemoryException during orchestration processing, the Windows Service restarts, uses full 2 GB memory, and crashes again.
The server is 32-bit and set to use the /3GB switch.
Also I've separated the flow into 3 hosts - one for receive, the other for orchestration, and the third for sends.
Anyone have any suggestions for getting this file to process wihout error?
Thanks,
Krip
If this is a flat file being sent through a map you are converting it to XML right? The increase in size could be huge. XML can easily add a factor of 5-10 times over a flat file. Especially if you use descriptive or long xml tag names (which normally you would).
Something simple you could try is to rename the xml nodes to shorter names, depending on the number of records (sounds like a lot) it might actually have a pretty significant impact on your memory footprint.
Perhaps a more enterprise approach, would be to subdivide this in a custom pipeline into separate message packets that can be fed through the system in more manageable chunks (similar to what Chris suggests). Then the system throttling and memory metrics could take over. Without knowing more about your data it would be hard to say how to best do this, but with a 125 MB file I am guessing that you probably have a ton of repeating rows that do not need to be processed sequentially.
Where does it crash? Does it make it past the Transform shape? Another suggestion to try is to run the transform in the Receive Port. For more efficient processing, you could even debatch the message and have multiple simultaneous orchestration instances be calling the stored procs. This would definately reduce the memory profile and increase performance.

Resources