AVX2 Streaming Stores Do Not Improve Performance

AVX2 Streaming Stores Do Not Improve Performance - vectorization

I have an AVX2 implementation of some workload.
I have determined that the vast majority of the execution time is occupied
by the memory loads and stores.
In an attempt to improve performance, I tried to change the conventional stores
to streaming (non-temporal) stores.
However, this change had little to no positive performance impact (I was expecting a sizeable performance increase).
What could be the reason for this?

The use of streaming stores can lead to a better performance under some circumstances:
The data "to be stored" is not read before writing: Streaming stores are write-through, which produces immediate bus traffic. The standard store uses a write back strategy which may delay the bus operation until a later time and avoids bus operations with multiple writes to the same cache line.
The time used for stores is smaller than the time used for calculation: A streaming store has to be finished before the next streaming store can be issued. Thus, ahving too liitle computation in between two streaming stores leads to some idle time for the processor in which no further computation can be executed. Where this problem may also be possible with standard stores, streaming stores even increase it.
The data "to be stored" is not needed shortly after being written: The streaming store surpasses caches while writing/storing. Thus, there is no copy of the data in the cache. When reading the data aftwerwards the data has to be loaded into the cache. Thus, you have no gain over a standard store. However, when using a standard store, the data is loaded into the cache, modified there, and maybe still there when a later access happens.
So you have to consider your code and problem, to these circumstances to know if streaming stores are worth a try. In an unfitting scenario your performance might even drop.
A blog entry with additional info and a benchmark can be found e.g. here.

Related

Is the time it takes to retrieve data from disk to RAM linear?

Let's suppose that it takes X seconds to retrieve 100MB from disk to RAM in one retrieval. Then does that mean that it will take also X seconds to retrieve the same 100MB divided up to 100MB/N per retrieval in N retrievals, where each retrieval is roughly takes X/N seconds?

It depends on various factors such as the storage device, the file system, the operating system and the retrieval method.
In general, when you perform multiple small retrievals, the time it takes to retrieve the data may be longer than if you perform a single large retrieval, due to the overhead of performing multiple operations. This is because the storage device may have to seek to different locations on the disk, which takes time, and the operating system may have to manage multiple requests.
Additionally, if you're using a hard disk drive (HDD) as the storage device, the time it takes to retrieve the data may be longer because hard drives are slower than solid-state drives (SSD) when it comes to random read operations. However, if you're using an SSD, the difference in retrieval time between multiple small retrievals and a single large retrieval may be less significant.
In the end, it's hard to give a general answer without more information about the specific storage device, file system and operating system you're using. But, in general, you should expect that multiple small retrievals might take longer than a single large retrieval.

Why do you use `stream` for GRPC/protobuf file transfers?

I've seen a couple examples like this:
service Service{
rpc updload(stream Data) returns (google.protobuf.Empty) {};
rpc download(google.protobuf.Empty) returns (stream Data) {};
}
message Data { bytes bytes = 1; }
What is the purpose of using stream, does it make the transfer more efficient?
In theory - yes - I obviously wan't to stream my file transfers but that's what happens over a connection... So, what is the actual benefit to this keyword, does it enforce some form of special buffering to reduce some overhead? Either way, the data is being transmitted, in full!

It's more efficient because, within a single call, multiple messages may be sent.
This avoids, not only re-establishing another (hopefully TLS i.e. even more work) connection with the server but also avoids spinning up client and server "stubs"; both the client and server are ready for more messages.
It's somewhat similar to being connected on a telephone call with your friend who, before hanging up, says "Oh, another thing...". Instead of hanging up the call and then, 5 minutes later, calling you back, interrupting dinner and causing you to pause a movie.

The answer is very similar to the gRPC + Image Upload question, although from a different perspective.
Doing a large download (10+ MB) as a single response message puts strong limits on the size of that download, as the entire response message is sent and processed at once. For most use cases, it is much better to chunk a 100 MB file into 1-10 MB chunks than require all 100 MB to be in memory at once. That also allows the downloader to begin processing the file before the entire file is acquired which reduces processing latency.
Without streaming, chunking would require multiple RPCs, which are annoying to coordinate and have performance complications. Because there is latency to complete RPCs, for reasonable performance you either have to do many RPCs in parallel (but how many?) or have a large batch size (but how big?). Multiple RPCs can also hit colder application caches, as each RPC goes to a different backend.
Using streaming provides the same throughput as the non-chunking approach without as many headaches of normal chunking approaches. Since streaming is pipelined (server can start sending next chunk as soon as previous chunk is sent) there's no added per-chunk latency between the client and server. This makes it much easier to choose a chunk size, as there is a wide range of "reasonable" sizes that will behave similarly and the system will naturally react as network performance varies.
While sending a message on an existing stream has less overhead than creating a new RPC, for many users the difference is negligible and it is generally better to structure your RPCs in a way that is architecturally beneficial to your application and not just to eek out small optimizations in gRPC. The reason to use the stream in this case is to make your application perform better at lower complexity.

How does the cpu decide which data it puts in what memory (ram, cache, registers)?

When the cpu is executing a program, does it move all data through the memory pipeline? Then any piece of data would be moved from ram->cache->registers so all data that's executed goes in the cpu registers at some point. Or does it somehow select the code it puts in those faster memory types, or can you as a programmer select specific code you want to keep in, for example, the cache for optimization?

The answer to this question is an entire course in itself! A very brief summary of what (usually) happens is that:
You, the programmer, specify what goes in RAM. Well, the compiler does it on your behalf, but you're in control of this by how you declare your variables.
Whenever your code accesses a variable the CPU's MMU will check if the value is in the cache and if it is not, then it will fetch the 'line' that contains the variable from RAM into the cache. Some CPU instruction sets may allow you to prevent it from doing so (causing a stall) for specific low-frequecy operations, but it requires very low-level code to do so. When you update a value, the MMU will perform a 'cache flush' operation, committing the cached memory to RAM. Again, you can affect how and when this happens by low-level code. It will also depend on the MMU configuration such as whether the cache is write-through, etc.
If you are going to do any kind of operation on the value that will require it being used by an ALU (arithmetic Logic Unit) or similar, then it will be loaded into an appropriate register from the cache. Which register will depend on the instruction the compiler generated.
Some CPUs support Dynamic Memory Access (DMA), which provides a shortcut for operations that do not really require the CPU to be involved. These include memory-to-memory copies and the transfer of data between memory and memory-mapped peripheral control blocks (such as UARTs and other I/O blocks). These will cause data to be moved, read or written in RAM without actually affecting the CPU core at all.
At a higher level, some operating systems that support multiple processes will save the RAM allocated to the current process to the hard disk when the process is swapped out, and load it back in again from the disk when the process runs again. (This is why you may find 'Page Files' on your C: drive and the options to limit their size.) This allows all of the running processes to utilise most of the available RAM, even though they can't actually share it all simultaneously. Paging is yet another subject worthy of a course on its own. (Thanks to Leeor for mentioning this.)

Data Retrieval Throughput - ETS lookup vs inter-process Messaging

suppose we have an erlang application which involves thousands of processes. Suppose there is a single resource X which may be a tuple, a list, or any erlang term, which all these processes may need to read / pick out something from it, at any moment in time.
An example of such an occurrence, is say, an API system, in which client processes may need to read and write on a remote machine. Ant it happens that you do not want, for each read/write request, a new connection to be created. So, what you do, you create a pool of connections, consider them as a pool of open pipes/sockets/channels.
Now, this pool of resources is to be shared by thousands of processes such that for each read or write demand, you want that process to retrieve any available open channel/resource.
Question is, what if i have a process (a single process) hold this information, whether in its process dictionary or in its receive loop. It would mean that all the processes would have to send a message to this process whenever they need a free resource. This single process would have a huge mailbox at any time because of the high demand for this single resource. OR I could use an ETS Table, and have only one row, say, #resources{key=pool,value= List_of_openSockets_or_channels}. But this would mean that, all our processes would attempt to make a read from the ETS Table for the same row at (high probability) same instantaneous times.
How would the ETS Table handle, if 10,000 process atttempt a read, for the same row/record from it, at the same time/at almost same time ? and yet, if i use a process, its mailbox, if 10,000 processes send a message to it, at same time, for the same resource (and it would need to reply each requestor). And remember this action may occur so frequently. What option (dis-regarding availability issues of process going down blah blah), would provide higher throughput, in a way that, processes would get what they need faster ? Is there any other better way, of handling high demand data structures in the Erlang VM in a way that will provide very fast access to millions of processes, even if they all needed that resource at the same time ?

Short answer: profile. Try different approaches and verify how your system behaves.
Firstly, I would look at ETS' {read_concurrency, true} option. From the documentation:
{read_concurrency,boolean()} Performance tuning. Default is false.
When set to true, the table is optimized for concurrent read
operations. When this option is enabled on a runtime system with SMP
support, read operations become much cheaper; especially on systems
with multiple physical processors. However, switching between read and
write operations becomes more expensive. You typically want to enable
this option when concurrent read operations are much more frequent
than write operations, or when concurrent reads and writes comes in
large read and write bursts (i.e., lots of reads not interrupted by
writes, and lots of writes not interrupted by reads). You typically do
not want to enable this option when the common access pattern is a few
read operations interleaved with a few write operations repeatedly. In
this case you will get a performance degradation by enabling this
option. The read_concurrency option can be combined with the
write_concurrency option. You typically want to combine these when
large concurrent read bursts and large concurrent write bursts are
common.
Secondly, I would look at caching possibilities. Are the processes reading that information only once or multiple times? If they're accessing it multiple times, you could read it once and store it in your process state.
Thirdly, you could try to replicate and distribute that piece of information across your system. Divide et impera.

If you use the process approach, in order to avoid having all the read requests serialized on the message queue of the 'server' process you must replicate.
Using an ETS table with read_concurrency feels more natural and it is something that I used when developing the parallel version of Dialyzer. However, ETS access was never a bottleneck in that case.

How does shared memory vs message passing handle large data structures?

In looking at Go and Erlang's approach to concurrency, I noticed that they both rely on message passing.
This approach obviously alleviates the need for complex locks because there is no shared state.
However, consider the case of many clients wanting parallel read-only access to a single large data structure in memory -- like a suffix array.
My questions:
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?

One thing to realise is that the Erlang concurrency model does NOT really specify that the data in messages must be copied between processes, it states that sending messages is the only way to communicate and that there is no shared state. As all data is immutable, which is fundamental, then an implementation may very well not copy the data but just send a reference to it. Or may use a combination of both methods. As always, there is no best solution and there are trade-offs to be made when choosing how to do it.
The BEAM uses copying, except for large binaries where it sends a reference.

Yes, shared state could be faster in this case. But only if you can forgo the locks, and this is only doable if it's absolutely read-only. if it's 'mostly read-only' then you need a lock (unless you manage to write lock-free structures, be warned that they're even trickier than locks), and then you'd be hard-pressed to make it perform as fast as a good message-passing architecture.
Yes, you could write a 'server process' to share it. With really lightweight processes, it's no more heavy than writing a small API to access the data. Think like an object (in OOP sense) that 'owns' the data. Splitting the data in chunks to enhance parallelism (called 'sharding' in DB circles) helps in big cases (or if the data is on slow storage).
Even if NUMA is getting mainstream, you still have more and more cores per NUMA cell. And a big difference is that a message can be passed between just two cores, while a lock has to be flushed from cache on ALL cores, limiting it to the inter-cell bus latency (even slower than RAM access). If anything, shared-state/locks is getting more and more unfeasible.
in short.... get used to message passing and server processes, it's all the rage.
Edit: revisiting this answer, I want to add about a phrase found on Go's documentation:
share memory by communicating, don't communicate by sharing memory.
the idea is: when you have a block of memory shared between threads, the typical way to avoid concurrent access is to use a lock to arbitrate. The Go style is to pass a message with the reference, a thread only accesses the memory when receiving the message. It relies on some measure of programmer discipline; but results in very clean-looking code that can be easily proofread, so it's relatively easy to debug.
the advantage is that you don't have to copy big blocks of data on every message, and don't have to effectively flush down caches as on some lock implementations. It's still somewhat early to say if the style leads to higher performance designs or not. (specially since current Go runtime is somewhat naive on thread scheduling)

In Erlang, all values are immutable - so there's no need to copy a message when it's sent between processes, as it cannot be modified anyway.
In Go, message passing is by convention - there's nothing to prevent you sending someone a pointer over a channel, then modifying the data pointed to, only convention, so once again there's no need to copy the message.

Most modern processors use variants of the MESI protocol. Because of the shared state, Passing read-only data between different threads is very cheap. Modified shared data is very expensive though, because all other caches that store this cache line must invalidate it.
So if you have read-only data, it is very cheap to share it between threads instead of copying with messages. If you have read-mostly data, it can be expensive to share between threads, partly because of the need to synchronize access, and partly because writes destroy the cache friendly behavior of the shared data.
Immutable data structures can be beneficial here. Instead of changing the actual data structure, you simply make a new one that shares most of the old data, but with the things changed that you need changed. Sharing a single version of it is cheap, since all the data is immutable, but you can still update to a new version efficiently.

What is a large data structure?
One persons large is another persons small.
Last week I talked to two people - one person was making embedded devices he used the word
"large" - I asked him what it meant - he say over 256 KBytes - later in the same week a
guy was talking about media distribution - he used the word "large" I asked him what he
meant - he thought for a bit and said "won't fit on one machine" say 20-100 TBytes
In Erlang terms "large" could mean "won't fit into RAM" - so with 4 GBytes of RAM
data structures > 100 MBytes might be considered large - copying a 500 MBytes data structure
might be a problem. Copying small data structures (say < 10 MBytes) is never a problem in Erlang.
Really large data structures (i.e. ones that won't fit on one machine) have to be
copied and "striped" over several machines.
So I guess you have the following:
Small data structures are no problem - since they are small data processing times are
fast, copying is fast and so on (just because they are small)
Big data structures are a problem - because they don't fit on one machine - so copying is essential.

Note that your questions are technically non-sensical because message passing can use shared state so I shall assume that you mean message passing with deep copying to avoid shared state (as Erlang currently does).
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
Using shared state will be a lot faster.
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Either approach can be used.
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?
Copying is cache unfriendly and, therefore, destroys scalability on multicores because it worsens contention for the shared resource that is main memory.
Ultimately, Erlang-style message passing is designed for concurrent programming whereas your questions about throughput performance are really aimed at parallel programming. These are two quite different subjects and the overlap between them is tiny in practice. Specifically, latency is typically just as important as throughput in the context of concurrent programming and Erlang-style message passing is a great way to achieve desirable latency profiles (i.e. consistently low latencies). The problem with shared memory then is not so much synchronization among readers and writers but low-latency memory management.

One solution that has not been presented here is master-slave replication. If you have a large data-structure, you can replicate changes to it out to all slaves that perform the update on their copy.
This is especially interesting if one wants to scale to several machines that don't even have the possibility to share memory without very artificial setups (mmap of a block device that read/write from a remote computer's memory?)
A variant of it is to have a transaction manager that one ask nicely to update the replicated data structure, and it will make sure that it serves one and only update-request concurrently. This is more of the mnesia model for master-master replication of mnesia table-data, which qualify as "large data structure".

The problem at the moment is indeed that the locking and cache-line coherency might be as expensive as copying a simpler data structure (e.g. a few hundred bytes).
Most of the time a clever written new multi-threaded algorithm that tries to eliminate most of the locking will always be faster - and a lot faster with modern lock-free data structures. Especially when you have well designed cache systems like Sun's Niagara chip level multi-threading.
If your system/problem is not easily broken down into a few and simple data accesses then you have a problem. And not all problems can be solved by message passing. This is why there are still some Itanium based super computers sold because they have terabyte of shared RAM and up to 128 CPU's working on the same shared memory. They are an order of magnitude more expensive then a mainstream x86 cluster with the same CPU power but you don't need to break down your data.
Another reason not mentioned so far is that programs can become much easier to write and maintain when you use multi-threading. Message passing and the shared nothing approach makes it even more maintainable.
As an example, Erlang was never designed to make things faster but instead use a large number of threads to structure complex data and event flows.
I guess this was one of the main points in the design. In the web world of google you usually don't care about performance - as long as it can run in parallel in the cloud. And with message passing you ideally can just add more computers without changing the source code.

Usually message passing languages (this is especially easy in erlang, since it has immutable variables) optimise away the actual data copying between the processes (of course local processes only: you'll want to think your network distribution pattern wisely), so this isn't much an issue.

The other concurrent paradigm is STM, software transactional memory. Clojure's ref's are getting a lot of attention. Tim Bray has a good series exploring erlang and clojure's concurrent mechanisms
http://www.tbray.org/ongoing/When/200x/2009/09/27/Concur-dot-next
http://www.tbray.org/ongoing/When/200x/2009/12/01/Clojure-Theses

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart