Where are NVMe commands located inside the PCIe BAR? - driver

According to the NVMe specification, the BAR has tail and head fields for each queue. For example:
Submission Queue y Tail Doorbell (SQyTDBL):
Start: 1000h + (2y * (4 << CAP.DSTRD))
End: 1003h + (2y * (4 << CAP.DSTRD))
Submission Queue y Head Doorbell (SQyHDBL):
Start: 1000h + ((2y + 1) * (4 << CAP.DSTRD))
End: 1003h + ((2y + 1) * (4 << CAP.DSTRD))
Are there the queue itself or just mere pointers? Is this correct? If it is the queue, I would assume the DSTRD indicates the maximum length of all queues.
Moreover, the specification talks about two optional regions: Host Memory Buffer (HMB) and Controller Memory Buffer (CMB).
HMB: a region within the host's DRAM (PCIe root)
CMB: a region within the NVMe controller's DRAM (inside the SSD)
If both are optional, where is it located then? Since endpoint PCIe only works with BARs and PCI Headers, I don't see any other place they might be located, other than a BAR.

Sorry but I am doing this from memory but I have implemented an FPGA NVMe host so hopefully my memory will be enough to answer your questions and more, if I get something wrong though at least you know why. I'll be providing reference sections from the specification which you can find here. https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf Also as a note before I really answer your question I want to clarify some confusion, understanding the spec takes some time I honestly recommend reading it bottom to top the last few sections help give context for the first few as strange as that sounds.
These are the submission and completion queues, specifically the subqueue tail and completion queue head respectively (SECTION 3.1). More on this later I just wanted to correct the missconception that you access the submission queue head as the host, you do not only the controller (traditionally the drive) does. A simple reminder submission is you asking the drive to do something, completion is the drive telling you how it went. Read SECTION 7.2 for more info.
Before you can send anything to these queues you must first setup said queues. Baseline in the system these queues do not exist, you must use the admin queue to set them up.
28h 2Fh ASQ Admin Submission Queue Base Address
30h 37h ACQ Admin Completion Queue Base Address
Your statement about DSTRD is a huge miss understanding. This field is from the capabilities register (0x0) Figure 3.1.1. This field is the controller (drive) telling you the "doorbell stride" which says how many bytes are between each doorbell, I've never seen a drive report anything but 0 for this value since well, why would you want to leave dead space between doorbell registers.
Please be careful with the size of your writes, in my experience most NVMe drives require you to send writes of at least 2dwords (8 bytes) even if you only intend to send 1dword of data, just a note.
Onto actually helping you use this thing as a host, please reference SECTION 7.6.1 to find the initialization sequence. Notice how you must setup multiple registers, read certain parameters and other such things.
Assuming you or someone else has done initalization let me now answer the core of your question, how to use these queues. The thing is, this answer spans MANY sections of the spec and is the core of it. So with that I am going to break it down as best I can for a simple write command. Please note you CANNOT write, until you have first created the queues using the admin queues which leverage different opcodes from a different section of the spec, sorry I cannot write all of this out.
STEPS TO WRITING DATA TO AN NVMe DRIVE.
In the creation of the submission queue you will specify the size of this specific queue. This is the number of commands that can be placed in the queue at one time for processing. Along with this you will specify the queue base address. So for this example let's assume you set the base address to 0x1000_0000 and size 16 (0x10). Figure 105 let's us know that every submission queue entry has a size of 64bytes (0x40) so queue entry 0 is at 0x1000_0000 entry 1 is at 0x1000_0040 2 0x1000_0080 and so on for our 16 entries then it loops back.
You will first store data for writing, let's say you were given 512bytes (0x200) of data to write. So for simplicity you place that data at 0x2000_0000 - 0x2000_0200.
You create the submission queue command. This is not a simple process. I'm not going to document all of this for you but understand you should be referencing Figure 104, Figure 346, and Section 6.15. This is not enough however. You will also need to understand PRP vs SGL and which you are using (PRP is easier to start with). NLB (Number of logical blocks) which determine your write size, with NVMe you do not specify writes in bytes but in terms of NLBs which the size is specified by the controller (drive), it may implement multiple NLB sizes but this is up to the drive not you as the host, you just get to pick from what it supports Section 5.15.2.1, Figure 245 You want to look at identify namespace to tell you the LBA (logical block address) size, this will lead you down a rabbit hole to determine the actual size but that's ok the info is there.
Ok so you finished this mess and have created the submission command. Let's assume the host has already completed 2 commands on this queue (at start this will be 0 I'm picking 2 just to be clearer in my example). What you now need to do is place this command at 0x1000_0080.
Now let's assume this is queue 1 (from the equation you posted the queue number is the y value. Note that queue 0 is the admin queue). What you need to do is poke the controllers submission queue tail doorbell to say how many commands are now loaded (thus you can queue multiple up at once and only tell the drive when you are ready to). In this case the number is 2. So you need to write the value 2 to register 0x1008.
At this point the drive will go. aha, the host has told me there are new commands to fetch. So the controller will go to queue base address + commandsize*2 and fetch 64bytes of data aka 1 command (address 0x1000_0080). The controller will decode this command as a write which means the controller (drive) must read data from some address and put it in memory where it was told to. This means your write command should tell the drive to go to address 0x2000_0000 and read 512 bytes of data, and it will if you scope the PCIe bus. At this point the drive will fill out a completion queue entry (16 bytes specified at Section 4.6) and place it in the completion queue address you specified at queue creation (plus 0x20 since this is the 2nd completion). Then the controller will generate and MSI-X interrupt.
At this point you must go to wherever the completion queue was placed and read the response to check status, and also if you queued multiple submissions check the SQID to see what finished since jobs can finish out of order. You then must write to the completion queue head (0x100C) to indicate that you have retrieved the completion queue (success or failure). Notice here you never interact with the submission queue head (that's up to the controller since only he knows when the submission queue entry was processed) and only the controller places things in the completion queue tail since only he can create new entries.
I'm sorry this is so long and not well formatted but hopefully you now have a slightly better understanding of NVMe, it's a bit of a mess at first but once you get it it all makes sense. Just remember my example assumed you had created a queue which baseline doesn't exist. First you need to setup the admin submission and completion queues (0x28 and 0x30) which has queue ID 0 thus it's tail/head doorbell is address 0x1000,0x1004 respectively. You then must reference Section 5 to find the opcodes to make stuff happen but I have faith you can figure it out from what I've given you. If you have any more questions put a comment down and I'll see what I can do.

Related

FreeRTOS - The reason for the stack increase?

I have created several tasks, the body of which is the same function. Inside the function, there is a delay that is the same for every task. So, when this delay is large enough, the stack of each task is filled with 6 words less than when the delay is less.
As far as I understand, the stack of tasks increases when there is a situation with several tasks with the status Ready.
1.In this situation, some additional 6-word context is written to the task stack?
2.In my example, it turns out 6 words (24 bytes), can this value change somehow?
3.What else can affect the increase in the stack, such as jumping into an interrupt handler?
int main(void){
xTaskCreate(Task_PrintCountString, "Task_1", mySTACK_SIZE, &param1, 1, NUUL);
xTaskCreate(Task_PrintCountString, "Task_2", mySTACK_SIZE, &param2, 1, NUUL);
xTaskCreate(Task_PrintCountString, "Task_3", mySTACK_SIZE, &param3, 1, NUUL);
vTaskStartScheduler();
}
#define myDELAY 100
void Task_PrintCountString(void *pParams){
uint16_t c=0;
for(;;){
if(xSemaphoreTake(WriteCountMutex, portMAX_DELAY) == pdTRUE){
PrintCountString(*(uint8_t *)pParams, c++);
xSemaphoreGive(WriteCountMutex);
}
vTaskDelay(myDELAY/portTICK_PERIOD_MS);//When myDELAY=1, the task stack is 6 words more than when myDELAY= 100!
}
}
Address 0x20000158 is the top of the stack for one of the tasks. Similarly, others!
The only function PrintCountString always the same depth. But at the same time, the stack can grow to its maximum knowledge in several stages, going through dozens of iterations of the task cycle! It turns out that not only the context is saved to the stack, but something else?
P.S.
I use ARM CM0 port and heap_1.
I made an observation:
In the PrintCountString function there was a block for waiting for the SPI flag - while(). I replaced the DMA transfer method and removed the while() loop. The stack of both tasks began to fill up immediately to the maximum value, regardless of delays.
FreeRTOS is just C code so, with one exception, the stack usage is determined by the compiler, including the selected optimisation level. The one exception be that when a task is not running its context is saved to its stack. The context size is fixed in all cases that matter. The size of the context depends on which FreeRTOS port you are using (there are more than 40).
There are a few things in your post I'm not sure about.
when this delay is large enough, the stack of each task is filled with
6 words less than when the delay is less
As above, the C code doesn't change depending on how long a delay you have - the compiler knows nothing about FreeRTOS or the concept of a delay. You may see different stack depths depending on where the code was in the call tree when an interrupt occurs (including interrupts that cause context switches).
I look at the top of the stack and count the remaining number of bytes
to be 0xA5
You don't say which port you are using, so I don't know if the stack grows up or down. In any case the stack is filled with 0xa5 when the task is created but never touched by the kernel again so the number of 0xa5s left shows the maximum amount of stack the task has used since it was created. That value is returned by calling uxTaskGetStackHighWaterMark().
In general the stack memory is used by the microcontroller itself. There are several instructions which writes/reads some data to/from the stack.
The PUSH instruction copies the content of one register to the stack and modifies the stack pointer register
The POP instruction reads a word from the stack into a register and modifies the stack pointer register
Both instructions are managed by the compiler. If a functions is executed the compiler creates some PUSH instructions to "free" some registers to work with. At the end of the function the original register state will be restored using the POP instructions to guarantee the upcoming control flow.
and each branch instruction stores several register values to the stack
The amount of used stack memory depends on the call hierarchy.
According to your example, if the SemaphoreTake function is called and finds a released semaphore the call hierarchy is different from a situation where the semaphore is blocked. This creates a different stack usage.

Why can't a load bypass a value written by another thread on the same core from a write buffer?

If a CPU core uses a write buffer, then the load can bypass the most recent store to the referenced location from the write buffer, without waiting until it will appear in the cache. But, as it's written in A Primer on Memory Consistency and Coherence, if the CPU honors TSO memory model, then
... multithreading introduces a subtle write buffer issue for TSO. TSO
write buffers are logically private to each thread context (virtual
core). Thus, on a multithreaded core, one thread context should never
bypass from the write buffer of another thread context. This logical
separation can be implemented with per-thread-context write buffers
or, more commonly, by using a shared write buffer with entries tagged
by thread-context identifiers that permit bypassing only when tags
match.
I can't grasp the necessity of this limitation. Could you please give me an example when allowing some thread to bypass a write buffer entry written by another thread on the same core leads to the violation of the TSO memory model?
The classic example of how TSO differs from sequential consistency (SC) is:
(This is example 2.4 here - http://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf)
thread 0 | thread 1
---------------------------------
write 1-->[x] | write 1-->[y]
a = read [x] | b = read [y]
c = read [y] | d = read [x]
Both addresses store 0 initially. The question is: would c=d=0 be a valid outcome? We know a and b must forward the stores before them since they match the addresses of the local stores, and will probably be forwarded from the local threads store buffer. However, c and d may not be forwarded across context, so they may still show the old value.
The interesting gotcha here is that since each thread observes both stores, and forwards the local one, and outcome of a=1,c=0 would mean that t0 saw the store to [x] occurring first. An outcome of b=1,d=0 would mean that t1 saw the store to [y] occurring first. The fact that this is a possible outcome due to store buffer forwarding would break sequential consistency as it requires that all contexts agree on the same global order of stores. Instead, x86 settled for a weaker TSO model that allows this case.
Forwarding stores globally is practically impossible since buffered stores are not necessarily committed, which means they may even be in the wrong path of a branch misprediction. Forwarding locally is fine since a flush would also eliminate all the loads that forwarded from them, but on multiple contexts you don't have that.
I've also seen work that tries to buffer stores globally outside of the core, but this is not very practical due to latency and bandwidth. For further reading, here's a recent paper that may be relevant - http://ieeexplore.ieee.org/abstract/document/7783736/

how to detect XMIT FIFO is full on a UART 16550 or higher

I have read already lot of specs and code about UART, but I cannot find any indication on how to find by software interface if the transmit FIFO is full. There is an interrupt when the FIFO is empty. Then I can write at least N characters, where N is the fifo size. But when I have written these N characters, a number of them have already been sent. So I can in fact write more than N characters, but there is no FIFO full interrupt. The specs says that when the fifo is full indeed the TXREADY pin on the chip is inverted. Is there a way to find this by software ? The Line Status Register bit only says that the fifo is not empty, which does not mean it is full...
Anyone can help ? I want to write characters until the fifo is full...
Looks to me also that they neglected this, but most people get by with the thing as it is. The usual way to use it is to get an interrupt, fill the FIFO (normally very fast compared to serial data rate) and then return.
There is a situation where it seems to me that what you are asking for could be nice...if transmitting in a polling mode...you want to send 10 bytes, your polling shows the FIFO is not empty, so you have not way to know if you can send them all or not...either you wait there until it is empty, which sort of defeats the purpose of the FIFO, or you continue polling other stuff until you get back to checking for FIFO empty, and maybe that slows your overall transmission rate. Guess it is not a very usual way to operate, so nobody worries about it.
The 16550D datasheet says the following:
The transmitter holding register interrupt (02) occurs when the XMIT
FIFO is empty; it is cleared as soon as the transmitter holding
register is written to (1 to 16 characters may be written to the XMIT
FIFO while servicing this interrupt) or the IIR is read.
This means that when the Line Status Register register (port base + 5) indicates Transmitter Empty condition (in bit 5), the transmit FIFO is completely empty and you may write up to 16 bytes to the transmitter holding register (port base + 0). It is important not to write more than 16 bytes between occurrences of the transmitter empty bit being set.
If you don't need to write 16 bytes at the point when you received the IRQ (or saw the transmitter register empty bit set, if polling), you can either keep track of how many bytes you wrote since the last transmitter empty state, or, just defer writing further bytes until the next transmitter empty state.

Missing master heartbeat does not cause node to react in a CANopen system

I have a strange finding about the heartbeat-protocol in CANopen. Maybe somebody else has seen something like this and maybe it is supposed to work like this... Anyway, here's what it's about:
In CANopen there are two timeout-based life-guarding mechanisms: the first is node guarding, which I will not mention further, since it's considered old news.
The other one is called heartbeat. It is pretty simple: Any participant on the network sends a regular message stating its node ID and its state. The frequency is defined by object 0x1017sub0 and is called heartbeat-producer-time. If it is set to zero, no heartbeat is being sent.
Any other participant can then define a number of nodes it wants to find on the network plus the maximum time there may be between two consecutive heartbeat-messages. This information is stored in object 0x1016sub1..n as 32-bit entries for as many nodes as this particular node wants to listen to.
The entries consist of the node ID (bits 22 to 16) and the mentioned maximum time that may elaps between heartbeats, called the heartbeat-consumer-time (in bits 15..0). Again if the entry is zero, it is being ignored.
As you may have gathered, there is no distinction between network-master (node ID 1) and slaves (node IDs 2 to 127).
So far the theory, now for my problem:
I configure one of the slave-nodes in my network as a heartbeat-consumer for the master, so there's an entry in object 0x1016sub1 that looks like this: 0x000107D0. Meaning that a heartbeat-message from the master is expected after at least two seconds.
I have observed that this works in two examples. If I send a master-heartbeat for a time and then stop, the node either returns to pre-operational mode or sends an appropriate emergency-message.
If I don't send any master-heartbeat-messages, I would expect that after I start the node (send it into operational mode) it takes at most two seconds for the node to either return to pre-operational mode or send an appropriate emergency-message or perhaps even both. But in the two examples I tried, nothing happened. If I never send any heartbeat, the node never expects one and just keeps on running.
The two examples are very different from each other. I am not sure whether they use the same CANopen-stack library perhaps.
Is there an explanation?
If you read CANopen User Manual, section 1.3.1.6, page 39, you will notice that the heartbeat consumer is first activated upon receiving a heartbeat from the producer. I would assume then that, since in your example the first heartbeat is never sent, the consumer is not activated.

Parse list 3 threads a time, when 5 completed works, server signal to do something

Hy I am curious does anyone know a tutorial example where semaphores are used for more than 1 process /thread. I'm looking forward to fix this problem. I have an array, of elements and an x number of threads. This threads work over the array, only 3 at a moment. After 5 works have been completed, the server is signelised and it clean those 5 nodes. But I'm having problems with the designing this problem. (node contains worker value which contains the 'name' of the thread that is allowed to work on it, respectivly nrNodes % nrThreads)
In order to make changes on the list a mutex is neccesarly in order not to overwrite / make false evaluations.
But i have no clue how to limit 3 threads to parse, at a given point, the list, and how to signal the main for cleaning session. I have been thinking aboutusing a semafor and a global constant. When the costant reaches 5, the server to be signaled(which probably would eb another thread.)
Sorry for lack of code but this is a conceptual question, what i have written so far doesn't affect the question in any way.

Resources