Write to different sections of a file from multiple threads - ruby-on-rails

In Ruby, if I open the file once and have multiple threads, each thread writes to a distinct range of the file. Do I still need any kind of lock to make this thread safe?
For example,
thread 1 - writes 0-99 bytes
thread 2 - writes 100-199 bytes
...

Related

Where are NVMe commands located inside the PCIe BAR?

According to the NVMe specification, the BAR has tail and head fields for each queue. For example:
Submission Queue y Tail Doorbell (SQyTDBL):
Start: 1000h + (2y * (4 << CAP.DSTRD))
End: 1003h + (2y * (4 << CAP.DSTRD))
Submission Queue y Head Doorbell (SQyHDBL):
Start: 1000h + ((2y + 1) * (4 << CAP.DSTRD))
End: 1003h + ((2y + 1) * (4 << CAP.DSTRD))
Are there the queue itself or just mere pointers? Is this correct? If it is the queue, I would assume the DSTRD indicates the maximum length of all queues.
Moreover, the specification talks about two optional regions: Host Memory Buffer (HMB) and Controller Memory Buffer (CMB).
HMB: a region within the host's DRAM (PCIe root)
CMB: a region within the NVMe controller's DRAM (inside the SSD)
If both are optional, where is it located then? Since endpoint PCIe only works with BARs and PCI Headers, I don't see any other place they might be located, other than a BAR.
Sorry but I am doing this from memory but I have implemented an FPGA NVMe host so hopefully my memory will be enough to answer your questions and more, if I get something wrong though at least you know why. I'll be providing reference sections from the specification which you can find here. https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf Also as a note before I really answer your question I want to clarify some confusion, understanding the spec takes some time I honestly recommend reading it bottom to top the last few sections help give context for the first few as strange as that sounds.
These are the submission and completion queues, specifically the subqueue tail and completion queue head respectively (SECTION 3.1). More on this later I just wanted to correct the missconception that you access the submission queue head as the host, you do not only the controller (traditionally the drive) does. A simple reminder submission is you asking the drive to do something, completion is the drive telling you how it went. Read SECTION 7.2 for more info.
Before you can send anything to these queues you must first setup said queues. Baseline in the system these queues do not exist, you must use the admin queue to set them up.
28h 2Fh ASQ Admin Submission Queue Base Address
30h 37h ACQ Admin Completion Queue Base Address
Your statement about DSTRD is a huge miss understanding. This field is from the capabilities register (0x0) Figure 3.1.1. This field is the controller (drive) telling you the "doorbell stride" which says how many bytes are between each doorbell, I've never seen a drive report anything but 0 for this value since well, why would you want to leave dead space between doorbell registers.
Please be careful with the size of your writes, in my experience most NVMe drives require you to send writes of at least 2dwords (8 bytes) even if you only intend to send 1dword of data, just a note.
Onto actually helping you use this thing as a host, please reference SECTION 7.6.1 to find the initialization sequence. Notice how you must setup multiple registers, read certain parameters and other such things.
Assuming you or someone else has done initalization let me now answer the core of your question, how to use these queues. The thing is, this answer spans MANY sections of the spec and is the core of it. So with that I am going to break it down as best I can for a simple write command. Please note you CANNOT write, until you have first created the queues using the admin queues which leverage different opcodes from a different section of the spec, sorry I cannot write all of this out.
STEPS TO WRITING DATA TO AN NVMe DRIVE.
In the creation of the submission queue you will specify the size of this specific queue. This is the number of commands that can be placed in the queue at one time for processing. Along with this you will specify the queue base address. So for this example let's assume you set the base address to 0x1000_0000 and size 16 (0x10). Figure 105 let's us know that every submission queue entry has a size of 64bytes (0x40) so queue entry 0 is at 0x1000_0000 entry 1 is at 0x1000_0040 2 0x1000_0080 and so on for our 16 entries then it loops back.
You will first store data for writing, let's say you were given 512bytes (0x200) of data to write. So for simplicity you place that data at 0x2000_0000 - 0x2000_0200.
You create the submission queue command. This is not a simple process. I'm not going to document all of this for you but understand you should be referencing Figure 104, Figure 346, and Section 6.15. This is not enough however. You will also need to understand PRP vs SGL and which you are using (PRP is easier to start with). NLB (Number of logical blocks) which determine your write size, with NVMe you do not specify writes in bytes but in terms of NLBs which the size is specified by the controller (drive), it may implement multiple NLB sizes but this is up to the drive not you as the host, you just get to pick from what it supports Section 5.15.2.1, Figure 245 You want to look at identify namespace to tell you the LBA (logical block address) size, this will lead you down a rabbit hole to determine the actual size but that's ok the info is there.
Ok so you finished this mess and have created the submission command. Let's assume the host has already completed 2 commands on this queue (at start this will be 0 I'm picking 2 just to be clearer in my example). What you now need to do is place this command at 0x1000_0080.
Now let's assume this is queue 1 (from the equation you posted the queue number is the y value. Note that queue 0 is the admin queue). What you need to do is poke the controllers submission queue tail doorbell to say how many commands are now loaded (thus you can queue multiple up at once and only tell the drive when you are ready to). In this case the number is 2. So you need to write the value 2 to register 0x1008.
At this point the drive will go. aha, the host has told me there are new commands to fetch. So the controller will go to queue base address + commandsize*2 and fetch 64bytes of data aka 1 command (address 0x1000_0080). The controller will decode this command as a write which means the controller (drive) must read data from some address and put it in memory where it was told to. This means your write command should tell the drive to go to address 0x2000_0000 and read 512 bytes of data, and it will if you scope the PCIe bus. At this point the drive will fill out a completion queue entry (16 bytes specified at Section 4.6) and place it in the completion queue address you specified at queue creation (plus 0x20 since this is the 2nd completion). Then the controller will generate and MSI-X interrupt.
At this point you must go to wherever the completion queue was placed and read the response to check status, and also if you queued multiple submissions check the SQID to see what finished since jobs can finish out of order. You then must write to the completion queue head (0x100C) to indicate that you have retrieved the completion queue (success or failure). Notice here you never interact with the submission queue head (that's up to the controller since only he knows when the submission queue entry was processed) and only the controller places things in the completion queue tail since only he can create new entries.
I'm sorry this is so long and not well formatted but hopefully you now have a slightly better understanding of NVMe, it's a bit of a mess at first but once you get it it all makes sense. Just remember my example assumed you had created a queue which baseline doesn't exist. First you need to setup the admin submission and completion queues (0x28 and 0x30) which has queue ID 0 thus it's tail/head doorbell is address 0x1000,0x1004 respectively. You then must reference Section 5 to find the opcodes to make stuff happen but I have faith you can figure it out from what I've given you. If you have any more questions put a comment down and I'll see what I can do.

Concurrently parsing records in a binary file in Go

I have a binary file that I want to parse. The file is broken up into records that are 1024 bytes each. The high level steps needed are:
Read 1024 bytes at a time from the file.
Parse each 1024-byte "record" (chunk) and place the parsed data into a map or struct.
Return the parsed data to the user and any error(s).
I'm not looking for code, just design/approach help.
Due to I/O constraints, I don't think it makes sense to attempt concurrent reads from the file. However, I see no reason why the 1024-byte records can't be parsed using goroutines so that multiple 1024-byte records are being parsed concurrently. I'm new to Go, so I wanted to see if this makes sense or if there is a better (faster) way:
A main function opens the file and reads 1024 bytes at a time into byte arrays (records).
The records are passed to a function that parses the data into a map or struct. The parser function would be called as a goroutine on each record.
The parsed maps/structs are appended to a slice via a channel. I would preallocate the underlying array managed by the slice as the file size (in bytes) divided by 1024 as this should be the exact number of elements (assuming no errors).
I'd have to make sure I don't run out of memory as well, as the file can be anywhere from a few hundred MB up to 256 TB (rare, but possible). Does this make sense or am I thinking about this problem incorrectly? Will this be slower than simply parsing the file in a linear fashion as I read it 1024 bytes at a time, or will parsing these records concurrently as byte arrays perform better? Or am I thinking about the problem all wrong?
I'm not looking for code, just design/approach help.
Cross-posted on Software Engineering
This is an instance of the producer-consumer problem, where the producer is your main function that generates 1024-byte records and the consumers should process these records and send them to a channel so they are added to the final slice. There are a few questions tagged producer-consumer and Go, they should get you started. As for what is fastest in your case, it depends on so many things that it is really not possible to answer. The best solution may be anywhere from a completely sequential implementation to a cluster of servers in which the records are moved around by RabbitMQ or something similar.

NSCondition, what if no lock when call signal?

From this Apple's document about NSCondition, the usage of NSCondition should be:
Thead 1:
[cocoaCondition lock];
while (timeToDoWork <= 0)
[cocoaCondition wait];
timeToDoWork--;
// Do real work here.
[cocoaCondition unlock];
Thread 2:
[cocoaCondition lock];
timeToDoWork++;
[cocoaCondition signal];
[cocoaCondition unlock];
And in the document of method signal in NSConditon:
You use this method to wake up one thread that is waiting on the condition. You may call this method multiple times to wake up multiple threads. If no threads are waiting on the condition, this method does nothing. To avoid race conditions, you should invoke this method only while the receiver is locked.
My question is:
I don't want the Thread 2 been blocked in any situation, so I removed the lock and unlock call in Thread 2. That is, Thread 2 can put as many work as it wish, Thread 1 will do the work one by one, if no more work, it wait (blocked). This is also a producer-consumer pattern, but the producer never been blocked.
But the way is not correct according to Apple's document So what things could possibly go wrong in this pattern? Thanks.
Failing to lock is a serious problem when multiple threads are accessing shared data. In the example from Apple's code, if Thread 2 doesn't lock the condition object then it can be incrementing timeToDoWork at the same time that Thread 1 is decrementing it. That can result in the results from one of those operations being lost. For example:
Thread 1 reads the current value of timeToDoWork, gets 1
Thread 2 reads the current value of timeToDoWork, gets 1
Thread 2 computes the incremented value (timeToDoWork + 1), gets 2
Thread 1 computes the decremented value (timeToDoWork - 1), gets 0
Thread 2 writes the new value of timeToDoWork, stores 2
Thread 1 writes the new value of timeToDoWork, stores 0
timeToDoWork started at 1, was incremented and decremented, so it should end at 1, but it actually ends at 0. By rearranging the steps, it could end up at 2, instead. Presumably, the value of timeToDoWork represents something real and important. Getting it wrong would probably screw up the program.
If your two threads are doing something as simple as incrementing and decrementing a number, they can do it without locks by using the atomic operation functions, such as OSAtomicIncrement32Barrier() and OSAtomicDecrement32Barrier(). However, if the shared data is any more complicated than that (and it probably is in any non-trivial case), then they really need to use synchronization mechanisms such as condition locks.

Mark Generation: What is VM: Dispatch continuations

What does the allocation under "VM: Dispatch continuations" signify?
(http://i.stack.imgur.com/4kuqz.png)
#InkGolem is on the right lines. This is a cache for dispatch blocks inside GCD.
#AbhiBeckert is off by a factor of 1000. 16MB is 2 million 64-bit pointers, not 2 billion.
This cache is allocated on a per-thread basis, and you're just seeing the allocation size of this cache, not what's actually in use. 16 MB is well within range, if you're doing lots of dispatching onto background threads (and since you're using RAC, I'm guessing that you are).
Don't worry about it, basically.
From what I understand, Continuations are a style of function pointer passing so that the process knows what to execute next, in your case I'm assuming those would be dispatch blocks from GCD. I'm assuming that the VM has a bunch of these that it uses over time, and that's what you're seeing in instruments. Then again, I'm not an expert on threading and I could be totally off in left field.

copy to the shared memory in cuda

In CUDA programming, if we want to use shared memory, we need to bring the data from global memory to shared memory. Threads are used for transferring such data.
I read somewhere (in online resources) that it is better not to involve all the threads in the block for copying data from global memory to shared memory. Such idea makes sense that all the threads are not executed together. Threads in a warp execute together. But my concern is all the warps are not executed sequentially. Say, a block with threads is divided into 3 warps: war p0 (0-31 threads), warp 1 (32-63 threads), warp 2 (64-95 threads). It is not guaranteed that warp 0 will be executed first (am I right?).
So which threads should I use to copy the data from global to shared memory?
To use a single warp to load a shared memory array, just do something like this:
__global__
void kernel(float *in_data)
{
__shared__ float buffer[1024];
if (threadIdx.x < warpSize) {
for(int i = threadIdx; i <1024; i += warpSize) {
buffer[i] = in_data[i];
}
}
__syncthreads();
// rest of kernel follows
}
[disclaimer: written in browser, never tested, use at own risk]
The key point here is the use of __syncthreads() to ensure that all threads in the block wait until the warp performing the load to shared memory have finished the load. The code I posted used the first warp, but you can calculate a warp number by dividing the thread index within the block by the warpSize. I also assumed a one-dimensional block, it is trivial to compute the thread index in a 2D or 3D block, so I leave that as an exercise to the reader.
As block is assigned to multiprocessor it resides there until all threads within that block are finished and during this time warp scheduler is mixing among warps that have ready operands. So if there is one block on multiprocessor with three warps and just one warp is fetching data from global to shared memory and other two warps are staying idle and probably waiting on __syncthreads() barrier, you loose nothing and you are limited just by latency of global memory what you would have been anyway. As soon as fetching is finished warps can go ahead in their work.
Therefore, no guarantee that warp0 is executed first is needed and you can use any threads. The only two things to keep in mind are to ensure as much coalesced access to global memory as possible and avoidance of bank conflicts.

Resources