why rep movb uses data segments? - memory

If using the old segments registers is outdated why does still exist in the current days when I have like this:
rep movsb %ds:(%rsi),%es:(%rdi)
what is this "ds" and "es" doing? It wouldn't be the same without the segments?

You don't need to specify them, but they are still "there." They are not being used as "Segment Registers", however; they are being used as selectors.
The segment registers are now used as selectors out of the Global Descriptor (or possibly local descriptor) table, which is used to define memory regions and their read/write permissions.

Related

forth implementation with JIT write protection?

I believe Apple has disabled being able to write and execute memory at the same time on the ARM64 architecture, see:
See mmap() RWX page on MacOS (ARM64 architecture)?
This makes it difficult to port implementations like jonesforth, which keeps generated code and the code to generate it (like the built-in assembler in jonesforth.f) in the same segment.
I thought I could do something like map the user space from start to HERE as 'r-x', and from here to the end as 'rw-'. Then I'd have to constantly remap memory as I compile new words, and I couldn't go and fix up previous words (I believe SCODE would make use of it).
Do you have any advice on how to handle such limitations ?
I guess I should look into other forth implementations that are running on M1 Macs.
A Forth implementation can have a problem with write-protected segments of code only when it generates machine code that should be executable at once. There is no such a problem if it uses threaded code. So it's supposed bellow that the Forth system have to generate machine code.
Data space and code space
Obviously, you have to separate code space from data space. Data space (at least mutable regions of, including regions for variables and data fields), as well as internal mutable memory regions and probably headers, should be mapped to 'rw-' segments. Code space should be mapped to 'r-x' segments.
The word here ( -- addr ) returns the address of the first cell available for reservation, which is writable for a program, and it should be always in an 'rw-' segment. You can have an internal word code::here ( -- addr ) that returns address in code space, if you need.
A decision for execution tokens is a compromise between speed and simplicity of implementation (an 'r-x' segment vs 'rw-'). The simplest case is that an execution token is represented by an address in an 'rw-' segment, and then execute does an additional dereferencing to get the corresponding address of code.
Code generation
In the given conditions we should generate machine code into an 'rw-' segment, but before this code is executed, this segment should be made 'r-x'.
Probably, the simplest solution is to allocate a memory block for every new definition, resize (minimize) the block on completion and make it 'r-x'. Possible disadvantages — losses due to page size (e.g. 4 KiB), and maybe memory fragmentation.
Changing protection of the main code segment starting from code::here also implies losses due to page size granularity.
Another variant is to break creating of a definition into two stages:
generate intermediate representation (IR) in a separate 'rw-' segment during compilation of a definition;
when the definition is completed, generate machine code in the main code segment from IR, and discard IR code.
Actually, it could be machine code on the first stage too, and then it's just relocated into another place on the second stage.
Before write to the main code segment you change it (or its part) to 'rw-', and after that revert it to 'r-x'.
A subroutine that translates IR code should be resided in another 'r-x' segment that you don't change.
Forth is agnostic to the format of generated code, and in a straightforward system only a handful of definitions "know" what format is generated. So only these definitions should be changed to generate IR code. If you relocate machine code, you probably don't need to change even these definitions.

creating temporary global memory variables inside kernels [duplicate]

This question already has answers here:
How to dynamically allocate arrays inside a kernel?
(5 answers)
Closed 2 years ago.
As per my knowledge, atomicAdd can be used on shared memory and global memory. I need to atomically add floating point numbers from threads of different blocks; hence, I need to use a global temporary to hold the sum.
Is there a way to allocate temporary globals from inside a kernel?
Currently, I allocate a temporary global and pass a pointer to my kernel. This doesn't appear to be very user-friendly.
TL;DR: require a temporary variable for atomic addition across different blocks without the need to explicitly allocate a global and pass a pointer to it to the kernel
You can use malloc() inside kernel code. However, it's rarely a good idea to do so. It's usually much better to pre-allocate scratch space before the kernel is launched, pass it as an argument, and let each thread, or group of threads, have some formula for determining the location they will use for their common atomics within that scratch area.
Now, you've written this isn't very "user-friendly"; I guess you mean developer-friendly. Well, it can be made more friendly! For example, my CUDA Modern C++ API wrappers library offers an equivalent of std::unique_ptr - but for device memory:
#include <cuda/api_wrappers.hpp>
//... etc. etc. ...
{
auto scratch = cuda::memory::device::make_unique<float[]>(1024, my_cude_device_id);
my_kernel<<<blah,blah,blah>>>(output, input, scratch.get();
} // the device memory is released here!
(this is for synchronous launches of course.)
Something else you can do be more developer-friendly is use some kind of proxy function to get the location in that scratch memory relevant to a specific thread / warp / group of threads / whatever, which uses the same address for atomics. That should at least hide away some of the repeating, annoying, address arithmetic your kernel might be using.
There's also the option of using global __device__ variables (like #RobertCrovella mentioned), but I wouldn't encourage that: The size would have to be fixed at compile time, and you wouldn't be able to use if from two kernels at once without it being painful, etc.

OpenCL When to use global, private, local, constant address spaces

I'm trying to learn OpenCL but I'm a having a hard time deciding which address spaces to use, as I only find assembled resources declaring what these address spaces are, but not why they exist or when to use them. The resources are at least too scattered, so with this question I hope to assemble all this information: what are all the address spaces, why do they exist, when to use which address space and what are the advantages and disadvantages regarding memory and performance.
As I understand it (which is probably too simplified), the GPU has two physical types of memory: global memory, far from the actual processors, so slow but pretty big and available to all workers, and local memory, close to the actual processors, so fast but small and not accessible from other workers.
Intuitively, the local qualifier makes sure a variable is placed on local memory and the global qualifier makes sure a variable is placed on global memory, though I'm not sure this is exactly what happens. This leaves the private and constant qualifiers. What's the purpose of those?
There also are some implicit qualifiers. For example, the specifications mention the generic address space, which is used for arguments with no qualifiers, I think. What does this do exactly? Then there also are local function variables. What's the address space for those?
Here is an example using my intuition, but without knowing what I'm actually doing:
Example:
Say I pass an array of type long and length 10000 to a kernel which I will only use to read, then I would declare it global const as it must be available to all workers and it will not change. Why wouldn't I use the constant qualifier? When setting the buffer for this array via the CPU, I actually also just could have made the array read-only, which in my eyes says the same as declaring it const. So again, when and why would I declare something constant or global const?
When performing memory-intensive tasks, would it be better to copy the array to a local array inside the kernel? My guess is that local memory would be too small, but what if the array only had a length of 10? When would the array be too big/small? More general: when is it worth copying data from global to local memory?
Say I also want to pass the length of this array, then I would add const int length to the arguments of my kernel, but I'm unsure why I would omit the global qualifier except because I have seen other people do it. After all, length must be accessible for all workers. If I'm right, then length would have a generic address space, but again, I don't really know what that means.
I hope someone with some experience can clear this up. That would be great not only for me, but I hope also for other enthusiasts who want to gain some practical knowledge concerning memory management on the GPU.
Constant: A small portion of cached global memory visible by all workers. Use it if you can, read only.
Global: Slow, visible by all, read or write. It is where all your data will end, so some accesses to it are always necessary.
Local: Do you need to share something in a local group? Use local! Do all your local workers access the same global memory? Use local!
Local memory is only visible inside local workers, and is limited in size, however is very fast.
Private: Memory that is only visible to a worker, consider it like registers. All non defined values are private by default.
Say I pass an array of type long and length 10000 to a kernel which I
will only use to read, then I would declare it global const as it must
be available to all workers and it will not change. Why wouldn't I use
the constant qualifier?
Actually, yes, you can and you should use constant qualifier. Which places your data on the constant memory (a small portion of read only memory quickly accessible by all workers). This is used by GPUs to transfer uniforms to all vertex shaders.
When setting the buffer for this array via the CPU, I actually also
just could have made the array read-only, which in my eyes says the
same as declaring it const. So again, when and why would I declare
something constant or global const?
Not really, when you create a buffer read only you are only specifiying to OpenCL you plan to use it read only, so it can do optimizations in the back, but you can actually write to it from a kernel.
global const is just a safeguard for the developer, so you don't accidentally write to it, it will give an error at compile time.
Basically, the same as in plain C host side computing. Programs will also work fine if all memory is non-const.
When performing memory-intensive tasks, would it be better to copy the array to a local array inside the kernel? My guess is that local memory would be too small, but what if the array only had a length of 10? When would the array be too big/small? More general: when is it worth copying data from global to local memory?
It is only worth if it is read by all workers. If each worker reads a single value of the global memory, then it is not worth.
Useful here:
Worker0 -> Reads 0,1,2,3
Worker1 -> Reads 0,1,2,3
Worker2 -> Reads 0,1,2,3
Worker3 -> Reads 0,1,2,3
Not useful here:
Worker0 -> Reads 0
Worker1 -> Reads 1
Worker2 -> Reads 2
Worker3 -> Reads 3
Say I also want to pass the length of this array, then I would add
const int length to the arguments of my kernel, but I'm unsure why I
would omit the global qualifier except because I have seen other
people do it. After all, length must be accessible for all workers. If
I'm right, then length would have a generic address space, but again,
I don't really know what that means.
When you don't specify a qualifier in the kernel parameter it typically defaults to constant, which is what you want for those small elements, to have a fast access by all workers.
The rules normally OpenCL compilers follow for kernel parameters is: if it only read and fits in constant, constant, otherwise global.

boost lockfree spsc_queue cache memory access

I need to be extremely concerned with speed/latency in my current multi-threaded project.
Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level.
I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue.
If the consumer core pops an element from the queue, I presume that means the element (a pointer in this case) is already loaded into the consumer core's L2 and L1 cache. But to access the element, does it not need to access the pointer itself by finding and loading the element either from either the L3 cache or across the interconnect (if the other thread is on a different cpu socket)? If so, would it maybe be better to simply send a copy of the object that could be disposed of by the consumer?
Thank you.
C++ principally a pay-for-what-you-need eco-system.
Any regular queue will let you choose the storage semantics (by value or by reference).
However, this time you ordered something special: you ordered a lock free queue.
In order to be lock free, it must be able to perform all the observable modifying operations as atomic operations. This naturally restricts the types that can be used in these operations directly.
You might doubt whether it's even possible to have a value-type that exceeds the system's native register size (say, int64_t).
Good question.
Enter Ringbuffers
Indeed, any node based container would just require pointer swaps for all modifying operations, which is trivially made atomic on all modern architectures.
But does anything that involves copying multiple distinct memory areas, in non-atomic sequence, really pose an unsolvable problem?
No. Imagine a flat array of POD data items. Now, if you treat the array as a circular buffer, one would just have to maintain the index of the buffer front and end positions atomically. The container could, at leisure update in internal 'dirty front index' while it copies ahead of the external front. (The copy can use relaxed memory ordering). Only as soon as the whole copy is known to have completed, the external front index is updated. This update needs to be in acq_rel/cst memory order[1].
As long as the container is able to guard the invariant that the front never fully wraps around and reaches back, this is a sweet deal. I think this idea was popularized in the Disruptor Library (of LMAX fame). You get mechanical resonance from
linear memory access patterns while reading/writing
even better if you can make the record size aligned with (a multiple) physical cache lines
all the data is local unless the POD contains raw references outside that record
How Does Boost's spsc_queue Actually Do This?
Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)
typedef typename boost::aligned_storage<max_size * sizeof(T),
boost::alignment_of<T>::value
>::type storage_type;
storage_type storage_;
T * data()
{
return static_cast<T*>(storage_.address());
}
(The element type T need not even be POD, but it needs to be both default-constructible and copyable).
Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):
static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
atomic<size_t> write_index_;
char padding1[padding_size]; /* force read_index and write_index to different cache lines */
atomic<size_t> read_index_;
In fact, as you can see, there are only the "internal" index on either read or write side. This is possible because there's only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.
Several other optimizations are present:
branch prediction hints for platforms that support it (unlikely())
it's possible to push/pop a range of elements at once. This should improve throughput in case you need to siphon from one buffer/ringbuffer into another, especially if the raw element size is not equal to (a whole multiple of) a cacheline
use of std::unitialized_copy where possible
The calling of trivial constructors/destructors will be optimized out at instantiation time
the unitialized_copy will be optimized into memcpy on all major standard library implementations (meaning that e.g. SSE instructions will be employed if your architecture supports it)
All in all, we see a best-in-class possible idea for a ringbuffer
What To Use
Boost has given you all the options. You can elect to make your element type a pointer to your message type. However, as you already raised in your question, this level of indirection reduces locality of reference and might not be optimal.
On the other hand, storing the complete message type in the element type could become expensive if copying is expensive. At the very least try to make the element type fit nicely into a cache line (typically 64 bytes on Intel).
So in practice you might consider storing frequently used data right there in the value, and referencing the less-of-used data using a pointer (the cost of the pointer will be low unless it's traversed).
If you need that "attachment" model, consider using a custom allocator for the referred-to data so you can achieve memory access patterns there too.
Let your profiler guide you.
[1] I suppose say for spsc acq_rel should work, but I'm a bit rusty on the details. As a rule, I make it a point not to write lock-free code myself. I recommend anyone else to follow my example :)

Why is a process's address space divided into four segments (text, data, stack and heap)?

Why does a process's address space have to divide into four segments (text, data, stack and heap)? What is the advandatage? is it possible to have only one whole big segment?
There are multiple reasons for splitting programs into parts in memory.
One of them is that instruction and data memories can be architecturally distinct and discontiguous, that is, read and written from/to using different instructions and circuitry inside and outside of the CPU, forming two different address spaces (i.e. reading code from address 0 and reading data from address 0 will typically return two different values, from different memories).
Another is reliability/security. You rarely want the program's code and constant data to change. Most of the time when that happens, it happens because something is wrong (either in the program itself or in its inputs, which may be maliciously constructed). You want to prevent that from happening and know if there are any attempts. Likewise you don't want the data areas that can change to be executable. If they are and there are security bugs in the program, the program can be easily forced to do something harmful when malicious code makes it into the program data areas as data and triggers those security bugs (e.g. buffer overflows).
Yet another is storage... In many programs a number of data areas aren't initialized at all or are initialized to one common predefined value (often 0). Memory has to be reserved for these data areas when the program is loaded and is about to start, but these areas don't need to be stored on the disk, because there's no meaningful data there.
On some systems you may have everything in one place (section/segment/etc). One notable example here is MSDOS, where .COM-style programs have no structure other than that they have to be less than about 64KB in size and the first executable instruction must appear at the very beginning of file and assume that its location corresponds to IP=0x100 (where IP is the instruction pointer register). How code and data are placed and interleaved in a .COM program is unimportant and up to the programmer.
There are other architectural artifacts such as x86 segments. Again, MSDOS is a good example of an OS that deals with them. .EXE-style programs in it may have multiple segments in them that correspond directly to the x86 CPU segments, to the real-mode addressing scheme, in which memory is viewed through 64KB-long "windows" known as segments. The position of these windows/segments is relative to the value of the CPU's segment registers. By altering the segment register values you can move the "windows". In order to access more than 64KB one needs to use different segment register values and that often implies having multiple segments in the .EXE (can be not just one segment for code and one for data, but also multiple segments for either of them).
At least the text and data segments are separated to prevent malicious code that's stored inside a variable from being run.
Instructions (compiled code) are stored in the text segment, while the contents of your variables are stored in a data segment, the latter of which never gets executed, only read from and written to.
A little more info here.
Isn't this distinction just a big, hacky workaround for patching security into the von-Neumann architecture where data and instructions share the same memory?

Resources