pthread scheduling and read() - pthreads

We are trying to stream raw video from a RAID array.
With our RAID array it is faster to loop and read 8k at a time than it is to read 8M at once.
I am trying to modify someone else's multi-threaded C program on openSuse 11.4 linux from (psuedo code)
read(fd, buf, 8M)
to
for(i = 0; i < 1000; ++i)
read(fd, buf, 8K);
But the looping version is reading much slower.
My suspicion is that the read thread is being swapped out every time it calls read(). I cannot be the first person to have hit this problem, but I've not found any example code for this. What is the best way to prevent the read thread being swapped out? Change the scheduler/priority? Change thread concurrency?

Related

OpenCV CUDA API very slow at the first call

I am using cuda::resize to resize a vector of images (in GpuMat)
It shows the first call takes ~15ms, and the rests only take ~0.3ms. So I want to ask if there are ways to shorten the time of the first call.
Here is my code(simplified):
for (int i = 0; i < num_images; ++i)
{
full_img = v_GpuMat[i].clone(); // vGpuMat is vector of images in cuda::GpuMat
seam_scale = 0.4377;
cuda::resize(full_img, img, Size(), seam_scale, seam_scale, INTER_LINEAR);
}
Thank you very much.
CUDA device memory allocations and copying data from device to host and vice versa are very slow. Please try allocale memory and load data outside the main loop. Cloning matrix allocates new device memory each time, try to use copying data instead of cloning it should speed up your code.
After checking the result in Nvidia-Visual-Profile, I found it is the cudaLaunchKernel that takes ~20ms and will be called only the first time.
If you have to make a continuous process like me, maybe one of the solution can be a dry run before you process your own tasks. As for this sample, make a cuda::resize out of the loop is much faster.

OpenCL slow memory access in for loop

I have a program that I built in OpenCL, in which each kernel accesses a read-only buffer located in global memory. At some point each kernel needs to copy some data from global memory into a temporary buffer. I made a for loop to copy a region of global memory byte-by-byte into the temporary buffer. I execute the aforementioned kernel using the clEnqueueNDRangeKernel command which is located inside a while loop. In order to measure how fast the clEnqueueNDRangeKernel command is, I added a counter called ups (Updates Per Second) which is incremented at the end of each while loop. Every one second I print the value of the counter and set it to zero.
I noticed that my program was running slowly, at about 53 ups. After some investigation I found out that the problem was the memory copying loop that was described above. This is the code:
typedef uchar byte;
byte tempBuffer[128]
byte* destPtr = (byte*)&tempBuffer0];
__global const byte* srcPtr = (__global const byte*)globalMemPtr;
for(size_t k = 0; k < regionSize; ++k)
{
destPtr[k] = srcPtr[k];
}
In variable globalMemPtr is a pointer to the region of global memory that needs to be copied into the temporary buffer, and tempBuffer the temporary buffer. The variable regionSize holds the size of the region to be copied in bytes. In this case its value is 12.
What I noticed was that if I replace regionSize with 12, the kernel runs much faster, at about 90 ups. My assumption is that the OpenCL compiler can optimize the for loop to copy memory much faster when 12 is used, but it can't when regionSize is used.
Does anyone know what is happening? Can any one help me?

What's the reason of using Circular Buffer in iOS Audio Calling APP?

My question is pretty much self explanatory. Sorry if it seems too dumb.
I am writing a iOS VoIP dialer and have checked some open-source code(iOS audio calling app). And almost all of those use Circular Buffer for storing recorded and received PCM audio data. SO i am wondering why we need to use a Circular Buffer in this case. What's the exact reason for using such audio buffer.
Thanks in advance.
Using a circular buffer lets you process the input and output data asynchronously from it's source. The audio render process takes place on a high priority thread. It asks for audio samples from your app (playback), and offers audio (recording/processing) on a timer in the form of callbacks.
A typical scenario would be for the audio callback to fire every 0.023 seconds to ask for (and/or offer) 1024 samples of audio. This thread is synchronized with system hardware so it is imperative that your callback returns before the 0.023 seconds is up. If you don't, the hardware won't wait for you, it will just skip that cycle and you will have an audible pop or silence, or miss audio you are trying to record.
A circular buffer's place is to pass data between threads. In an audio application that would be moving the samples to and from the audio thread asynchronously. One thread produces samples on to the "head" of the buffer, and the other thread consumes them from the "tail".
Here's an example, retrieving audio samples from the microphone and writing them to disk. Your app has subscribed to a callback that fires every 0.023 seconds, offering 1024 samples to be recorded. The naive approach would be to simply write the audio to disk from within that callback.
void myCallback(float *samples,int sampleCount, SampleSaver *saver){
SampleSaverSaveSamples(saver,samples,sampleCount);
}
This will work!! Most of the time...
The problem is that there is no guarantee that writing to disk will finish before 0.023 seconds, so every now and then, your recording has a pop in it because SampleSaver just plain took too long and the hardware just skips the next callback.
The right way to do this is to use a circular buffer. I personally use TPCircularBuffer because it's awesome. The way it works (externally) is that you ask the buffer for a pointer to write data to (the head) on one thread, then on another thread you ask the buffer for a pointer to read from (the tail). Here's how it would be done using TPCircularBuffer (skipping setup and using a simplified callback).
//this is on the high priority thread that can't wait for anything like a slow write to disk
void myCallback(float *samples,int sampleCount, TPCircularBuffer *buffer){
int32_t availableBytes = 0;
float *head = TPCircularBufferHead(buffer, &availableBytes);
memcpy(head,samples,sampleCount * sizeof(float));//copies samples to head
TPCircularBufferProduce(buffer,sampleCount * sizeof(float)); //moves buffer head "forward in the circle"
}
This operation is super quick and puts no extra pressure on that sensitive audio thread. You then create your own timer a separate thread to write the samples to disk.
//this is on some background thread that can take it's sweet time
void myLeisurelySavingCallback(TPCircularBuffer *buffer, SampleSaver *saver){
int32_t available;
float *tail = TPCircularBufferTail(buffer, &available);
int samplesInBuffer = available / sizeof(float); //mono
SampleSaverSaveSamples(saver, tail, samplesInBuffer);
TPCircularBufferConsume(buffer, samplesInBuffer * sizeof(float)); // moves tail forward
}
And there you have it, not only do you avoid audio glitches, but if you initialize a big enough buffer, you can set your write to disk callback to only fire every second or two (after the circular buffer has built up a good bit of audio) which is much easier on your system than writing to disk every 0.023 seconds!
The main reason to use the buffer is so the samples can be handled asynchronously. They are a great way to pass messages between threads without locks as well. Here is a good article explaining a neat memory trick for the implementation of a circular buffer.
Good question. There is another good reason for using Circular Buffer.
In iOS, if you use callbacks(Audio unit) for recording and playing audio(In-fact you need to use it if you want to create a real-time audio transferring app) then you will get a chunk of data for a specific amount of time(let's say 20 milliseconds) from the recorder callback. And in iOS, you will never get fixed length of data always(If you set the callback interval as 20ms then you will get 370 or 372 bytes of data. And you will never know when you will get 370 bytes or 372 bytes. Correct me if i am wrong). Then, to transfer the audio through UDP packets you need to use a codec for data encoding and decoding(G729 is generally used for VoIP apps). But g729 takes data by the multiplier of 8. Assume, you encode 368(8*46) bytes per 20ms. So what are you going to do with rest of the data ? You need to store it by sequence for the next chunk to process.
SO that's the reason. There are some other details thing but i kapt it simple for your better understanding. Just comment below if you have any question.

Compute shader: read data written in one thread from another?

Can somebody tell me whether the following compute shader is possible with DirectX 11?
I want the first thread in a Dispatch that accesses an element in a buffer (g_positionsGrid) to set (compare exchange) that element with a temporary value to signify that it is taking some action.
In this case the temp value is 0xffffffff and the first thread is going to go continue on and allocate a value from a structed append buffer (g_positions) and assign it to that element.
So all fine so far but the other threads in the dispatch can come in inbetween the compare exchange and the allocation of the first thread and so need to wait until the allocation index is available. I do this with a busy wait ie the while loop.
However sadly this just locks up the GPU as I'm assuming that the value written by the first thread is not propogated through to the other threads stuck in the while loop.
Is there any way to get those threads to see that value?
Thanks for any help!
RWStructuredBuffer<float3> g_positions : register(u1);
RWBuffer<uint> g_positionsGrid : register(u2);
void AddPosition( uint address, float3 pos )
{
uint token = 0;
// Assign a temp value to signify first thread has accessed this particular element
InterlockedCompareExchange(g_positionsGrid[address], 0, 0xffffffff, token);
if(token == 0)
{
//If first thread in here allocate index and assign value which
//hopefully the other threads will pick up
uint index = g_positions.IncrementCounter();
g_positionsGrid[address] = index;
g_positions[index].m_position = pos;
}
else
{
if(token == 0xffffffff)
{
uint index = g_positionsGrid[address];
//This never meets its condition
[allow_uav_condition]
while(index == 0xffffffff)
{
//For some reason this thread never gets the assignment
//from the first thread assigned above
index = g_positionsGrid[address];
}
g_positions[index].m_position = pos;
}
else
{
//Just assign value as the first thread has already allocated a valid slot
g_positions[token].m_position = pos;
}
}
}
Thread sync in DirectCompute is very easy, but comparing to same features to CPU threading is very unflexible. AFAIK, the only way to sync data between threads in compute shader is to use groupshared memory and GroupMemoryBarrierWithGroupSync(). That means, that you can:
create small temporary buffer in groupshared memory
calculate value
write to groupshared buffer
synchronize threads with GroupMemoryBarrierWithGroupSync()
read from groupshared from another thread and use it somehow
To implement all this stuff, you need proper array indices. But where you can take it from? In DirectCompute values passed in Dispatch and system values that you can get in shader (SV_GroupIndex, SV_DispatchThreadID, SV_GroupThreadID, SV_GroupID) related. Using that values you can calculate indices to assess you buffers.
Compute shaders are not well documented, and there is no easy way, but to find out more info at least you can:
read MSDN: Compute shader overview
watch DirectCompute Lecture Series videos on channel9
examine compute shader samples from DirectX SDK, very nice
samples from NVIDIA`s SDK (10 and 11)
read this advanced NVIDIA paper where they implemented thread reduction and then optimize their code to run 10 times faster ;)
As of your code. Well, probably you can redesign it a little.
It is always good to all threads do the same task. Symmetric loading. Actually, you can not assign different tasks for you threads as you do it in CPU code.
If your data first need some preprocessing, and further processing, you may want to divide it to differrent Dispatch() calls (different shaders) that you will call in sequence:
preprocessShader reads from buffer inputData and writes to preprocessedData
calculateShader feads from preprocessedData and writes to finalData
In this case you can drop out any slow thread sync and slow group shared memory.
Look at "Thread reduction" trick mentioned above.
Hope it helps! And happy coding!

CUDA: are access times for texture memory similar to coalesced global memory?

My kernel threads access a linear character array in a coalesced fashion. If I map
the array to texture I don't see any speedup. The running times are
almost the same. I'm working on a Tesla C2050 with compute capability 2.0 and read
somewhere that global accesses are cached. Is that true? Perhaps that is why I
am not seeing a difference in the running time.
The array in the main program is
char *dev_database = NULL;
cudaMalloc( (void**) &dev_database, JOBS * FRAGMENTSIZE * sizeof(char) );
and I bind it to texture texture<char> texdatabase with
cudaBindTexture(NULL, texdatabase, dev_database, JOBS * FRAGMENTSIZE * sizeof(char) );
Each thread then reads a character ch = tex1Dfetch(texdatabase, p + id) where id
is threadIdx.x + blockIdx.x * blockDim.x and p is an offset.
I'm binding only once and dev_database is a large array. Actually I found that
if the size is too large the bind fails. Is there a limit on the size of the array
to bind? Thanks very much.
There are several possibilities for why you don't see any difference in performance, but the most likely is that this memory access is not your bottleneck. If it is not your bottleneck, making it faster will have no effect on performance.
Regarding caching: for this case, since you are reading only bytes, each warp will read 32 bytes, which means each group of 4 warps will map to each cache line. So assuming few cache conflicts, you will get up to 4x reuse from the cache. So if this memory access is a bottleneck, it is conceivable that the texture cache might not benefit you more than the general-purpose cache.
You should first determine if you are bandwidth bound and if this data access is the culprit. Once you have done that, then optimize your memory accesses. Another tactic to consider is to access 4 to 16 chars per thread per load (using a char4 or int4 struct with byte packing/unpacking) rather than one per thread to increase the number of memory transactions in flight at a time -- this can help to saturate the global memory bus.
There is a good presentation by Paulius Micikevicius from GTC 2010 that you might want to watch. It covers both analysis-driven optimization and the specific concept of memory transactions in flight.

Resources