How do I sample all threads and record their thread id with perf? - perf

I want to get information about what each thread is doing at regular intervals. Unfortunately I can see in the perf script output that no thread id is recorded, because the output looks the same with -F +tid as with -F -tid.
I tried using the --per-thread option but it doesn't do what I want. Instead, it seems to drop the timestamp field from the data.
Is this possible? If not, what does the data reported by perf for a multithreaded program mean – does it just sample the main thread, or random threads?

As it turns out, perf record already records threads and their ID. What got me confused is that the thread ID of the main thread is equal to the process ID. I also must have been doing something wrong when doing the -F -tid test, because indeed the column with the thread ID disappears.

Related

Getting Null Pointer exception while trying to get value from option in Apache beam

I am using JAVA8 and Apache beam 2.19.0 to run some dataflow jobs. As per my requirement I am setting option value dynamically in code as following.
option.setDay(ValueProvider.StaticValueProvider.of(sDay))
I am trying to get this in another transformation in same dataflow pipeline. When I run for small data its work fine I am able to get options.getDay().get() value but for huge data such as 5 million lines in different files it is giving Null pointer exception at options.getDay().get().
Adding more example points to this question for better understanding.
If I am reading 1 millions of line it execute well.
If I am reading 2 millions of line it execute well but give
Throttling logger worker. It used up its 30s quota for logs in
only 25.107s
If I am reading more than 2 millions of line it gives Throttling
logger worker. It used up its 30s quota for logs in only 25.107s and
Null pointer exception at options.getDay().get()
If I understood correctly, it looks like you're tying to setDay on every element in the stream. I guess that one element is calling set, and another element is trying to get or set again in parallel, which causes the null pointer exception.
To fix this you can pass the sDay on the element itself on another property, instead of modifying the options.

Where are NVMe commands located inside the PCIe BAR?

According to the NVMe specification, the BAR has tail and head fields for each queue. For example:
Submission Queue y Tail Doorbell (SQyTDBL):
Start: 1000h + (2y * (4 << CAP.DSTRD))
End: 1003h + (2y * (4 << CAP.DSTRD))
Submission Queue y Head Doorbell (SQyHDBL):
Start: 1000h + ((2y + 1) * (4 << CAP.DSTRD))
End: 1003h + ((2y + 1) * (4 << CAP.DSTRD))
Are there the queue itself or just mere pointers? Is this correct? If it is the queue, I would assume the DSTRD indicates the maximum length of all queues.
Moreover, the specification talks about two optional regions: Host Memory Buffer (HMB) and Controller Memory Buffer (CMB).
HMB: a region within the host's DRAM (PCIe root)
CMB: a region within the NVMe controller's DRAM (inside the SSD)
If both are optional, where is it located then? Since endpoint PCIe only works with BARs and PCI Headers, I don't see any other place they might be located, other than a BAR.
Sorry but I am doing this from memory but I have implemented an FPGA NVMe host so hopefully my memory will be enough to answer your questions and more, if I get something wrong though at least you know why. I'll be providing reference sections from the specification which you can find here. https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf Also as a note before I really answer your question I want to clarify some confusion, understanding the spec takes some time I honestly recommend reading it bottom to top the last few sections help give context for the first few as strange as that sounds.
These are the submission and completion queues, specifically the subqueue tail and completion queue head respectively (SECTION 3.1). More on this later I just wanted to correct the missconception that you access the submission queue head as the host, you do not only the controller (traditionally the drive) does. A simple reminder submission is you asking the drive to do something, completion is the drive telling you how it went. Read SECTION 7.2 for more info.
Before you can send anything to these queues you must first setup said queues. Baseline in the system these queues do not exist, you must use the admin queue to set them up.
28h 2Fh ASQ Admin Submission Queue Base Address
30h 37h ACQ Admin Completion Queue Base Address
Your statement about DSTRD is a huge miss understanding. This field is from the capabilities register (0x0) Figure 3.1.1. This field is the controller (drive) telling you the "doorbell stride" which says how many bytes are between each doorbell, I've never seen a drive report anything but 0 for this value since well, why would you want to leave dead space between doorbell registers.
Please be careful with the size of your writes, in my experience most NVMe drives require you to send writes of at least 2dwords (8 bytes) even if you only intend to send 1dword of data, just a note.
Onto actually helping you use this thing as a host, please reference SECTION 7.6.1 to find the initialization sequence. Notice how you must setup multiple registers, read certain parameters and other such things.
Assuming you or someone else has done initalization let me now answer the core of your question, how to use these queues. The thing is, this answer spans MANY sections of the spec and is the core of it. So with that I am going to break it down as best I can for a simple write command. Please note you CANNOT write, until you have first created the queues using the admin queues which leverage different opcodes from a different section of the spec, sorry I cannot write all of this out.
STEPS TO WRITING DATA TO AN NVMe DRIVE.
In the creation of the submission queue you will specify the size of this specific queue. This is the number of commands that can be placed in the queue at one time for processing. Along with this you will specify the queue base address. So for this example let's assume you set the base address to 0x1000_0000 and size 16 (0x10). Figure 105 let's us know that every submission queue entry has a size of 64bytes (0x40) so queue entry 0 is at 0x1000_0000 entry 1 is at 0x1000_0040 2 0x1000_0080 and so on for our 16 entries then it loops back.
You will first store data for writing, let's say you were given 512bytes (0x200) of data to write. So for simplicity you place that data at 0x2000_0000 - 0x2000_0200.
You create the submission queue command. This is not a simple process. I'm not going to document all of this for you but understand you should be referencing Figure 104, Figure 346, and Section 6.15. This is not enough however. You will also need to understand PRP vs SGL and which you are using (PRP is easier to start with). NLB (Number of logical blocks) which determine your write size, with NVMe you do not specify writes in bytes but in terms of NLBs which the size is specified by the controller (drive), it may implement multiple NLB sizes but this is up to the drive not you as the host, you just get to pick from what it supports Section 5.15.2.1, Figure 245 You want to look at identify namespace to tell you the LBA (logical block address) size, this will lead you down a rabbit hole to determine the actual size but that's ok the info is there.
Ok so you finished this mess and have created the submission command. Let's assume the host has already completed 2 commands on this queue (at start this will be 0 I'm picking 2 just to be clearer in my example). What you now need to do is place this command at 0x1000_0080.
Now let's assume this is queue 1 (from the equation you posted the queue number is the y value. Note that queue 0 is the admin queue). What you need to do is poke the controllers submission queue tail doorbell to say how many commands are now loaded (thus you can queue multiple up at once and only tell the drive when you are ready to). In this case the number is 2. So you need to write the value 2 to register 0x1008.
At this point the drive will go. aha, the host has told me there are new commands to fetch. So the controller will go to queue base address + commandsize*2 and fetch 64bytes of data aka 1 command (address 0x1000_0080). The controller will decode this command as a write which means the controller (drive) must read data from some address and put it in memory where it was told to. This means your write command should tell the drive to go to address 0x2000_0000 and read 512 bytes of data, and it will if you scope the PCIe bus. At this point the drive will fill out a completion queue entry (16 bytes specified at Section 4.6) and place it in the completion queue address you specified at queue creation (plus 0x20 since this is the 2nd completion). Then the controller will generate and MSI-X interrupt.
At this point you must go to wherever the completion queue was placed and read the response to check status, and also if you queued multiple submissions check the SQID to see what finished since jobs can finish out of order. You then must write to the completion queue head (0x100C) to indicate that you have retrieved the completion queue (success or failure). Notice here you never interact with the submission queue head (that's up to the controller since only he knows when the submission queue entry was processed) and only the controller places things in the completion queue tail since only he can create new entries.
I'm sorry this is so long and not well formatted but hopefully you now have a slightly better understanding of NVMe, it's a bit of a mess at first but once you get it it all makes sense. Just remember my example assumed you had created a queue which baseline doesn't exist. First you need to setup the admin submission and completion queues (0x28 and 0x30) which has queue ID 0 thus it's tail/head doorbell is address 0x1000,0x1004 respectively. You then must reference Section 5 to find the opcodes to make stuff happen but I have faith you can figure it out from what I've given you. If you have any more questions put a comment down and I'll see what I can do.

Using random() function on multiple threads

I'm working on an app where I need reproducible random numbers. I use srandom() with a seed to initialize the random number sequence. Then I use random() to generate the random numbers from this seed. If this is the only thread generating random numbers, everything works fine. However, if there are multiple threads generating random numbers, they interfere with each other.
Apparently, the sequence of random numbers is not thread safe. There must be a central random number generator that is called by all threads.
My app generates hundreds of objects, each one of which has four sequences of 14 random numbers generated this way. Each of these 4 sequences has its own non-random seed. This way, the random numbers should be reproducible. The problem is, because of the thread interference I just described, sometimes the sequence of 14 numbers being generated will be interrupted by a random number request by another thread.
After thinking about this for a while, I've decided to call
dispatch_sync(dispatch_get_main_queue(), ^{//generate the 14 numbers});
to get each sequence. This should force them to get generated in the proper sequence. In reading the documentation, it says there could be a deadlock if dispatch_sync is called on the queue it's running in. How can I tell if I'm already on the main queue? If I am, I don't need to dispatch anything, right?
Is there a better way to do this?
I suspect another way to do this is similar to this but using a dedicated queue instead of the main queue. I've never tried making my own queue before. Also, the method that needs to call the queue is an ephemeral one, so I'd need to somehow pass the custom queue around if I'm going to go that route. How does one pass a queue as an argument?
For now, I'm running with my idea, above, dispatching synchronously to the main queue, and the app seems to work fine. Worst case scenario, this snippet of code would be run about 4800 times (4 for each of 1200 objects, which is currently the max.).
I assume you want computationally random numbers, rather than cryptographic random numbers.
My suggestion would be to have separate RNGs for each thread, with each thread RNG seeded centrally from a master RNG. Since the system RNG is not thread safe, then create your own small RNG method -- a good LCG should work -- for use exclusively within one thread.
Use the built-in random() to produce only the initial seeds for each of your sub-threads. Setting the overall initial seed with srandom() will ensure that the thread local my_random() methods will all get a consistent initial reseed as long as the threads are started in the same order each time.
Effectively you are building a hierarchy of RNGs to match your hierarchy of threads.
Another option would be to have a singleton do the computation. The object needing the set of random numbers would ask the singleton for them in a batch.

Parse list 3 threads a time, when 5 completed works, server signal to do something

Hy I am curious does anyone know a tutorial example where semaphores are used for more than 1 process /thread. I'm looking forward to fix this problem. I have an array, of elements and an x number of threads. This threads work over the array, only 3 at a moment. After 5 works have been completed, the server is signelised and it clean those 5 nodes. But I'm having problems with the designing this problem. (node contains worker value which contains the 'name' of the thread that is allowed to work on it, respectivly nrNodes % nrThreads)
In order to make changes on the list a mutex is neccesarly in order not to overwrite / make false evaluations.
But i have no clue how to limit 3 threads to parse, at a given point, the list, and how to signal the main for cleaning session. I have been thinking aboutusing a semafor and a global constant. When the costant reaches 5, the server to be signaled(which probably would eb another thread.)
Sorry for lack of code but this is a conceptual question, what i have written so far doesn't affect the question in any way.

The memory consistency model CUDA 4.0 and global memory?

Update: The while() condition below gets optimized out by the compiler, so both threads just skip the condition and enter the C.S. even with -O0 flag. Does anyone know why the compiler is doing this? By the way, declaring the global variables volatile causes the program to hang for some odd reason...
I read the CUDA programming guide but I'm still a bit unclear on how CUDA handles memory consistency with respect to global memory. (This is different from the memory hierarchy) Basically, I am running tests trying to break sequential consistency. The algorithm I am using is Peterson's algorithm for mutual exclusion between two threads inside the kernel function:
flag[threadIdx.x] = 1; // both these are global
turn = 1-threadIdx.x;
while(flag[1-threadIdx.x] == 1 && turn == (1- threadIdx.x));
shared_gloabl_variable_x ++;
flag[threadIdx.x] = 0;
This is fairly straightforward. Each thread asks for the critical section by setting its flag to one and by being nice by giving the turn to the other thread. At the evaluation of the while(), if the other thread did not set its flag, the requesting thread can then enter the critical section safely. Now a subtle problem with this approach is that if the compiler re-orders the writes so that the write to turn executes before the write to flag. If this happens both threads will end up in the C.S. at the same time. This fairly easy to prove with normal Pthreads, since most processors don't implement sequential consistency. But what about GPUs?
Both of these threads will be in the same warp. And they will execute their statements in lock-step mode. But when they reach the turn variable they are writing to the same variable so the intra-warp execution becomes serialized (doesn't matter what the order is). Now at this point, does the thread that wins proceed onto the while condition, or does it wait for the other thread to finish its write, so that both can then evaluate the while() at the same time? The paths again will diverge at the while(), because only one of them will win while the other waits.
After running the code, I am getting it to consistently break SC. The value I read is ALWAYS 1, which means that both threads somehow are entering the C.S. every single time. How is this possible (GPUs execute instructions in order)? (Note: I have compiled it with -O0, so no compiler optimization, and hence no use of volatile).
Edit: since you have only two threads and 1-threadIdx.x works, then you must be using thread IDs 0 and 1. Threads 0 and 1 will always be part of the same warp on all current NVIDIA GPUs. Warps execute instructions SIMD fashion, with a thread execution mask for divergent conditions. Your while loop is a divergent condition.
When turn and flags are not volatile, the compiler probably reorders the instructions and you see the behavior of both threads entering the C.S.
When turn and flags are volatile, you see a hang. The reason is that one of the threads will succeed at writing turn, so turn will be either 0 or 1. Let's assume turn==0: If the hardware chooses to execute thread 0's part of the divergent branch, then all is OK. But if it chooses to execute thread 1's part of the divergent branch, then it will spin on the while loop and thread 0 will never get its turn, hence the hang.
You can probably avoid the hang by ensuring that your two threads are in different warps, but I think that the warps must be concurrently resident on the SM so that instructions can issue from both and progress can be made. (Might work with concurrent warps on different SMs, since this is global memory; but that might require __threadfence() and not just __threadfence_block().)
In general this is a great example of why code like this is unsafe on GPUs and should not be used. I realize though that this is just an investigative experiment. In general CUDA GPUs do not—as you mention most processors do not—implement sequential consistency.
Original Answer
the variables turn and flag need to be volatile, otherwise the load of flag will not be repeated and the condition turn == 1-threadIdx.X will not be re-evaluated but instead will be taken as true.
There should be a __threadfence_block() between the store to flag and store to turn to get the right ordering.
There should be a __threadfence_block() before the shared variable increment (which should also be declared volatile). You may also want a __syncthreads() or at least __threadfence_block() after the increment to ensure it is visible to other threads.
I have a hunch that even after making these fixes you may still run into trouble, though. Let us know how it goes.
BTW, you have a syntax error in this line, so it's clear this isn't exactly your real code:
while(flag[1-threadIdx.x] == 1 and turn==[1- threadIdx.x]);
In the absence of extra memory barriers such as __threadfence(), sequential consistency of global memory is enforced only within a given thread.

Resources