Trying to keep N CPUs running threads - pthreads

I want to run (say) 25 CPU intensive tasks in my 6 cores computer by using each time 5 cores (so 1 be left for other tasks). Each of the 25 CPU intensive tasks can finish at different times, for example one task can finish at 20 minutes while other task can take up to 4 hours.
Using pthreads I've managed to build a very simple algorithm which can launch the 5 threads, wait for its completion and launch another 5 threads again. The weak point of the algorithm is that it can not take advantage of the fact that different threads may finish at different times, instead it always waits for the job completion of all the 5 threads before launch another set.
Here is the code extract that launches the threads
#define MAX_REACTOR_THREADS 5
/* ....................... */
/* SOME LENGTHY STUFF HERE */
/* ....................... */
int cases = 25;
pthread_t *threads_REACTOR = calloc(cases, sizeof(pthread_t));
int index = 0;
/* CALL THREADS PROCEDURE FOR COMPUTING FUNCTION: REACTOR_THREAD */
int int_div = cases/MAX_REACTOR_THREADS;
int I, k;
for (I=0; I<int_div; I++){
/* LAUNCH THREADS */
for(k=0; k<MAX_REACTOR_THREADS; k++){
index = int_div*k+I;
pthread_create(&threads_REACTOR[index], NULL,
&REACTOR_THREAD, (void*) &inputs_array[index]);
}
/* JOIN THREADS */
for(k=0; k<MAX_REACTOR_THREADS; k++){
index = int_div*k+I;
pthread_join(threads_REACTOR[index], NULL);
}
/* HERE PRINT PROGRESS BAR */
Progress_Bar((I+1)*MAX_REACTOR_THREADS, cases, 40, COLOR_Y);
}
if((cases%MAX_REACTOR_THREADS) != 0){
/* LAUNCH REMAINING THREADS */
for(index=MAX_REACTOR_THREADS*int_div; index<cases; index++){
pthread_create(&threads_REACTOR[index], NULL,
&REACTOR_THREAD, (void*) &inputs_array[index]);
}
/* JOIN REMAINING THREADS */
for(index=MAX_REACTOR_THREADS*int_div; index<cases; index++){
pthread_join(threads_REACTOR[index], NULL);
/* HERE PRINT PROGRESS BAR */
Progress_Bar(index+1, cases, 40, COLOR_Y);
}
}
/* ....................... */
/* SOME LENGTHY STUFF HERE */
/* ....................... */
How can this be improved so it can launch a new task at the moment a thread has finished done its job?

Related

STM32 - Reading I2S to record a .WAV file. Audio choppy, what is causing it?

I'm using an STM32 (STM32F446RE) to receive audio from two INMP441 mems microphone in an stereo setup via I2S protocol and record it into a .WAV on a micro SD card, using the HAL library.
I wrote the firmware that records audio into a .WAV with FreeRTOS. But the audio files that I record sound like Darth Vader. Here is a screenshot of the audio in audacity:
if you zoom in you can see a constant noise being inserted in between the real audio data:
I don't know what is causing this.
I have tried increasing the MessageQueue, but that doesnt seem to be the problem, the queue is kept at 0 most of the time. I've tried different frame sizes and sampling rates, changing the number of channels, using only one inmp441. All this without any success.
I proceed explaining the firmware.
Here is a block diagram of the architecture for the RTOS that I have implemented:
It consists of three tasks. The first one receives a command via UART (with interrupts) that signals to start or stop recording. the second one is simply an state machine that walks through the steps to write a .WAV.
Here the code for the WriteWavFileTask:
switch(audio_state)
{
case STATE_START_RECORDING:
sprintf(filename, "%saud_%03d.wav", SDPath, count++);
do
{
res = f_open(&file_ptr, filename, FA_CREATE_ALWAYS|FA_WRITE);
}
while(res != FR_OK);
res = fwrite_wav_header(&file_ptr, I2S_SAMPLE_FREQUENCY, I2S_FRAME, 2);
HAL_I2S_Receive_DMA(&hi2s2, aud_buf, READ_SIZE);
audio_state = STATE_RECORDING;
break;
case STATE_RECORDING:
osDelay(50);
break;
case STATE_STOP:
HAL_I2S_DMAStop(&hi2s2);
while(osMessageQueueGetCount(AudioQueueHandle)) osDelay(1000);
filesize = f_size(&file_ptr);
data_len = filesize - 44;
total_len = filesize - 8;
f_lseek(&file_ptr, 4);
f_write(&file_ptr, (uint8_t*)&total_len, 4, bw);
f_lseek(&file_ptr, 40);
f_write(&file_ptr, (uint8_t*)&data_len, 4, bw);
f_close(&file_ptr);
audio_state = STATE_IDLE;
break;
case STATE_IDLE:
osThreadSuspend(WAVHandle);
audio_state = STATE_START_RECORDING;
break;
default:
osDelay(50);
break;
Here are the macros used in the code for readability:
#define I2S_DATA_WORD_LENGTH (24) // industry-standard 24-bit I2S
#define I2S_FRAME (32) // bits per sample
#define READ_SIZE (128) // samples to read from I2S
#define WRITE_SIZE (READ_SIZE*I2S_FRAME/16) // half words to write
#define WRITE_SIZE_BYTES (WRITE_SIZE*2) // bytes to write
#define I2S_SAMPLE_FREQUENCY (16000) // sample frequency
The last task is the responsible for processing the buffer received via I2S. Here is the code:
void convert_endianness(uint32_t *array, uint16_t Size) {
for (int i = 0; i < Size; i++) {
array[i] = __REV(array[i]);
}
}
void HAL_I2S_RxCpltCallback(I2S_HandleTypeDef *hi2s)
{
convert_endianness((uint32_t *)aud_buf, READ_SIZE);
osMessageQueuePut(AudioQueueHandle, aud_buf, 0L, 0);
HAL_I2S_Receive_DMA(hi2s, aud_buf, READ_SIZE);
}
void pvrWriteAudioTask(void *argument)
{
/* USER CODE BEGIN pvrWriteAudioTask */
static UINT *bw;
static uint16_t aud_ptr[WRITE_SIZE];
/* Infinite loop */
for(;;)
{
osMessageQueueGet(AudioQueueHandle, aud_ptr, 0L, osWaitForever);
res = f_write(&file_ptr, aud_ptr, WRITE_SIZE_BYTES, bw);
}
/* USER CODE END pvrWriteAudioTask */
}
This tasks reads from a queue an array of 256 uint16_t elements containing the raw audio data in PCM. f_write takes the Size parameter in number of bytes to write to the SD card, so 512 bytes. The I2S Receives 128 frames (for a 32 bit frame, 128 words).
The following is the configuration for the I2S and clocks:
Any help would be much appreciated!
Solution
As pmacfarlane pointed out, the problem was with the method used for buffering the audio data. The solution consisted of easing the overhead on the ISR and implementing a circular DMA for double buffering. Here is the code:
#define I2S_DATA_WORD_LENGTH (24) // industry-standard 24-bit I2S
#define I2S_FRAME (32) // bits per sample
#define READ_SIZE (128) // samples to read from I2S
#define BUFFER_SIZE (READ_SIZE*I2S_FRAME/16) // number of uint16_t elements expected
#define WRITE_SIZE_BYTES (BUFFER_SIZE*2) // bytes to write
#define I2S_SAMPLE_FREQUENCY (16000) // sample frequency
uint16_t aud_buf[2*BUFFER_SIZE]; // Double buffering
static volatile int16_t *BufPtr;
void convert_endianness(uint32_t *array, uint16_t Size) {
for (int i = 0; i < Size; i++) {
array[i] = __REV(array[i]);
}
}
void HAL_I2S_RxHalfCpltCallback(I2S_HandleTypeDef *hi2s)
{
BufPtr = aud_buf;
osSemaphoreRelease(RxAudioSemHandle);
}
void HAL_I2S_RxCpltCallback(I2S_HandleTypeDef *hi2s)
{
BufPtr = &aud_buf[BUFFER_SIZE];
osSemaphoreRelease(RxAudioSemHandle);
}
void pvrWriteAudioTask(void *argument)
{
/* USER CODE BEGIN pvrWriteAudioTask */
static UINT *bw;
/* Infinite loop */
for(;;)
{
osSemaphoreAcquire(RxAudioSemHandle, osWaitForever);
convert_endianness((uint32_t *)BufPtr, READ_SIZE);
res = f_write(&file_ptr, BufPtr, WRITE_SIZE_BYTES, bw);
}
/* USER CODE END pvrWriteAudioTask */
}
Problems
I think the problem is your method of buffering the audio data - mainly in this function:
void HAL_I2S_RxCpltCallback(I2S_HandleTypeDef *hi2s)
{
convert_endianness((uint32_t *)aud_buf, READ_SIZE);
osMessageQueuePut(AudioQueueHandle, aud_buf, 0L, 0);
HAL_I2S_Receive_DMA(hi2s, aud_buf, READ_SIZE);
}
The main problem is that you are re-using the same buffer each time. You have queued a message to save aud_buf to the SD-card, but you've also instructed the I2S to start DMAing data into that same buffer, before it has been saved. You'll end up saving some kind of mish-mash of "old" data and "new" data.
#Flexz pointed out that the message queue takes a copy of the data, so there is no issue about the I2S writing over the data that is being written to the SD-card. However, taking the copy (in an ISR) adds overhead, and delays the start of the new I2S DMA.
Another problem is that you are doing the endian conversion in this function (that is called from an ISR). This will block any other (lower priority) interrupts from being serviced while this happens, which is a bad thing in an embedded system. You should do the endian conversion in the task that reads from the queue. ISRs should be very short and do the minimum possible work (often just setting a flag, giving a semaphore, or adding something to a queue).
Lastly, while you are doing the endian conversion, what is happening to audio samples? The previous DMA has completed, and you haven't started a new one, so they will just be dropped on the floor.
Possible solution
You probably want to allocate a suitably big buffer, and configure your DMA to work in circular buffer mode. This means that once started, the DMA will continue forever (until you stop it), so you'll never drop any samples. There won't be any gap between one DMA finishing and a new one starting, since you never need to start a new one.
The DMA provides a "half-complete" interrupt, to say when it has filled half the buffer. So start the DMA, and when you get the half-complete interrupt, queue up the first half of the buffer to be saved. When you get the fully-complete interrupt, queue up the second half of the buffer to be saved. Rinse and repeat.
You might want to add some logic to detect if the interrupt happens before the previous save has completed, since the data will be overrun and possibly corrupted. Depending on the speed of the SD-card (and the sample rate), this may or may not be a problem.

Why are there time bubbles in my GPU timeline even when triple buffering?

I'm having trouble understanding why there are time bubbles on my GPU timeline when inspecting my app using PIX timing captures. Here is a picture of one of the time bubbles I'm talking about, highlighted in orange:
The timeline actually doesn't look at all how I expected. Since I am triple buffering, I would expect the GPU to be constantly working, without any time gaps between frames because the CPU is easily able to feed commands to the GPU before the GPU is done processing them. Instead, it doesn't seem like the CPU is 3 frames ahead. It seems like the CPU is constantly waiting for the GPU to be finished before it starts working on a new frame. So it makes me wonder if my triple buffering code is possibly broken? Here is my code for moving to the next frame:
void gpu_interface::next_frame()
{
UINT64 current_frame_fence_value = get_frame_resource()->fence_value;
UINT64 next_frame_fence_value = current_frame_fence_value + 1;
check_hr(swapchain->Present(0, 0));
check_hr(graphics_cmd_queue->Signal(fence.Get(), current_frame_fence_value));
{
// CPU and GPU frame-to-frame event.
PIXEndEvent(graphics_cmd_queue.Get());
PIXBeginEvent(graphics_cmd_queue.Get(), 0, "fence value: %d", next_frame_fence_value);
}
// Check if the next frame is ready to be rendered.
// The GPU must have reached at least up to the fence value of the frame we're about to render.
if (fence->GetCompletedValue() < current_frame_fence_value)
{
PIXBeginEvent(0, "CPU Waiting for GPU to reach fence value: %d", current_frame_fence_value);
// Wait for the next frame resource to be ready
fence->SetEventOnCompletion(current_frame_fence_value, fence_event);
WaitForSingleObject(fence_event, INFINITE);
PIXEndEvent();
}
// Next frame is ready to be rendered
// Update the frame_index. GetCurrentBackBufferIndex() gets incremented after swapchain->Present() calls.
frame_index = swapchain->GetCurrentBackBufferIndex();
frames[frame_index].fence_value = next_frame_fence_value;
}
Here's the whole timing capture: https://1drv.ms/u/s!AiGFMy6hVmtNgaky52n7QDrQ6o7V1A?e=MFc4xW
EDIT: Fixed answer
void gpu_interface::next_frame()
{
check_hr(swapchain->Present(0, 0));
UINT64 current_frame_fence_value = get_frame_resource()->fence_value;
UINT64 next_frame_fence_value = current_frame_fence_value + 1;
check_hr(graphics_cmd_queue->Signal(fence.Get(), current_frame_fence_value));
//// Update the frame_index. GetCurrentBackBufferIndex() gets incremented after swapchain->Present() calls.
frame_index = swapchain->GetCurrentBackBufferIndex();
// The GPU must have reached at least up to the fence value of the frame we're about to render.
size_t minimum_fence = get_frame_resource()->fence_value;
size_t completed = fence->GetCompletedValue();
if (completed < minimum_fence)
{
PIXBeginEvent(0, "CPU Waiting for GPU to reach fence value: %d", minimum_fence);
// Wait for the next frame resource to be ready
fence->SetEventOnCompletion(minimum_fence, fence_event);
WaitForSingleObject(fence_event, INFINITE);
PIXEndEvent();
}
frames[frame_index].fence_value = next_frame_fence_value;
{
// CPU and GPU frame-to-frame event.
PIXEndEvent(graphics_cmd_queue.Get());
PIXBeginEvent(graphics_cmd_queue.Get(), 0, "fence value: %d", next_frame_fence_value);
}
}
Timing capture of the correct code: https://1drv.ms/u/s!AiGFMy6hVmtNgakzGizTiA_s-FwPqA?e=qIHHTw
You signal the queue with current_frame_fence_value and right after you check
if (fence->GetCompletedValue() < current_frame_fence_value)
if the fence completed that value. You need to check the fence value for the next frame to see if you can continue and that is fence_values[frame_index] where frame_index is updated. It would go something like this:
void gpu_interface::next_frame()
{
check_hr(swapchain->Present(0, 0));
UINT64 current_frame_fence_value = get_frame_resource()->fence_value;
check_hr(graphics_cmd_queue->Signal(fence.Get(), current_frame_fence_value));
UINT64 next_frame_fence_value = current_frame_fence_value + 1;
frame_index = swapchain->GetCurrentBackBufferIndex();
// The GPU must have reached at least up to the fence value of the frame we're about to render.
//current_frame_fence_value is not the fence value of the frame you are about the render, it is fence_values[frame_index]
//note that frame_index is updated before this call
if (fence->GetCompletedValue() < fence_values[frame_index])
{
// Wait for the next frame resource to be ready
fence->SetEventOnCompletion(fence_values[frame_index], fence_event);
WaitForSingleObject(fence_event, INFINITE);
}
frames[frame_index].fence_value = next_frame_fence_value;
}
Try writing down fence values for the first few frames to see how that works.

Tracking iOS memory usage

I am looking for a way to track iOS memory usage from within my app as accurately as possible. After many trials I end up with the following code:
uint64_t memory_usage (void) {
task_vm_info_data_t vmInfo;
mach_msg_type_number_t count = TASK_VM_INFO_COUNT;
if (task_info(mach_task_self(), TASK_VM_INFO, (task_info_t) &vmInfo, &count) != KERN_SUCCESS) return -1;
return vmInfo.internal;
}
When executed from a real device the value returned is in sync with the value reported by the Xcode debugger.
The behaviour that I cannot explain is when allocated memory reaches about 500MB the value returned by this function do not increase (it continue to report a value in the 450/450MB range) while the Xcode debugger continue to grow up until 1.3GB and then a memory warning is notified to my app.
The code I used to test this function is fired every 0.5 seconds from within an NSTimer with the following action:
- (void) checkMemory {
uint64_t mem_usage = memory_usage();
// allocate a 5MB buffer to test memory usage
uint32_t buffer_size = 5 * 1024 * 1024;
char *buffer = (char *)malloc(buffer_size);
memset(buffer, 0, buffer_size);
p[i++] = buffer;
}
Am I missing something?

detect pulse objective c

At the moment i'm creating an iOS app which is visualizing port status of an arduino. Therefor the iPad is receiving information via Serial Cable from Arduino.
The Arduino sends every 100ms a package with it's current port status. This status is visualized on the iPad.
The Ports are input Ports. I've recognized that the device i'm reading is pulsing the ports so the Arduino reads high low level alternating. That creates flickering in the visualization.
My question is now how to detect if the level is up or the input is flickering.
The port is high for x seconds get low for y seconds and after that it repeats. If the port is low for z seconds i need to set the port as low in the visualization. Otherwise it is high.
- (void) readBytesAvailable:(UInt32)numBytes {
int bytesRead = [manager read:rxBuffer Length:numBytes];
if(rxBuffer[i]==48){
[self setButtonRed];
}else if(rxBuffer[i]==49){
[self setButtonWhite]
}
}
https://www.dropbox.com/s/bhy5lbm8lkdhnoy/3wire.png?dl=0
If I understood well the scenario is this: you want to determine if the output is alternating its state or it is stuck at ground. You didn't specify the period/up-low time nor the number of pins, so I assume that you have four buttons connected to arduino pins 1,2,3,5 and I'll use literals.
You'll have to set CHECK_PERIOD to an appropriate sampling period so that you can sample the input 4/5 times for every state, and CHECK_ITERATIONS so that you can also miss some points.
For instance if the normal wave is 100ms high and 100ms low, I'd set CHECK_PERIOD to 20 and CHECK_ITERATIONS to, let's say, 3 or 4.
long previousInputCheck;
#define NUM_INPUTS 4
const int inputPins[] = { 1, 2, 3, 5}
unsigned char inputCounter[NUM_INPUTS];
unsigned char inputStates[NUM_INPUTS];
... THEN, INTO THE MAIN ...
if ((millis() - previousInputCheck) >= CHECK_PERIOD)
{
previousInputCheck += CHECK_PERIOD;
unsigned char i;
for (i = 0; i < NUM_INPUTS; i++)
{
if (digitalRead(inputPins[i]) == LOW)
{
if (inputCounter[i] <= CHECK_ITERATIONS)
inputCounter[i]++;
if (inputCounter[i] == CHECK_ITERATIONS)
{
inputStates[i] = LOW;
}
}
else
{ // HIGH
inputCounter[i] = 0;
inputStates[i] = HIGH;
}
}
}

Varispeed with Libsndfile, Libsamplerate and Portaudio in C

I'm working on an audio visualizer in C with OpenGL, Libsamplerate, portaudio, and libsndfile. I'm having difficulty using src_process correctly within my whole paradigm. My goal is to use src_process to achieve Vinyl Like varispeed in real time within the visualizer. Right now my implementation changes the pitch of the audio without changing the speed. It does so with lots of distortion due to what sounds like missing frames as when I lower the speed with the src_ratio it almost sounds granular like chopped up samples. Any help would be appreciated, I keep experimenting with my buffering chunks however 9 times out of 10 I get a libsamplerate error saying my input and output arrays are overlapping. I've also been looking at the speed change example that came with libsamplerate and I can't find where I went wrong. Any help would be appreciated.
Here's the code I believe is relevant. Thanks and let me know if I can be more specific, this semester was my first experience in C and programming.
#define FRAMES_PER_BUFFER 1024
#define ITEMS_PER_BUFFER (FRAMES_PER_BUFFER * 2)
float src_inBuffer[ITEMS_PER_BUFFER];
float src_outBuffer[ITEMS_PER_BUFFER];
void initialize_SRC_DATA()
{
data.src_ratio = 1; //Sets Default Playback Speed
/*---------------*/
data.src_data.data_in = data.src_inBuffer; //Point to SRC inBuffer
data.src_data.data_out = data.src_outBuffer; //Point to SRC OutBuffer
data.src_data.input_frames = 0; //Start with Zero to Force Load
data.src_data.output_frames = ITEMS_PER_BUFFER
/ data.sfinfo1.channels; //Number of Frames to Write Out
data.src_data.src_ratio = data.src_ratio; //Sets Default Playback Speed
}
/* Open audio stream */
err = Pa_OpenStream( &g_stream,
NULL,
&outputParameters,
data.sfinfo1.samplerate,
FRAMES_PER_BUFFER,
paNoFlag,
paCallback,
&data );
/* Read FramesPerBuffer Amount of Data from inFile into buffer[] */
numberOfFrames = sf_readf_float(data->inFile, data->src_inBuffer, framesPerBuffer);
/* Looping of inFile if EOF is Reached */
if (numberOfFrames < framesPerBuffer)
{
sf_seek(data->inFile, 0, SEEK_SET);
numberOfFrames = sf_readf_float(data->inFile,
data->src_inBuffer+(numberOfFrames*data->sfinfo1.channels),
framesPerBuffer-numberOfFrames);
}
/* Inform SRC Data How Many Input Frames To Process */
data->src_data.end_of_input = 0;
data->src_data.input_frames = numberOfFrames;
/* Perform SRC Modulation, Processed Samples are in src_outBuffer[] */
if ((data->src_error = src_process (data->src_state, &data->src_data))) {
printf ("\nError : %s\n\n", src_strerror (data->src_error)) ;
exit (1);
}
* Write Processed SRC Data to Audio Out and Visual Out */
for (i = 0; i < framesPerBuffer * data->sfinfo1.channels; i++)
{
// gl_audioBuffer[i] = data->src_outBuffer[i] * data->amplitude;
out[i] = data->src_outBuffer[i] * data->amplitude;
}
I figured out a solution that works well enough for me and am just going to explain it best I can for anyone else with a similar issue. So to get the Varispeed to work, the way the API works is you give it a certain number of frames, and it spits out a certain number of frames. So for a SRC ratio of 0.5, if you process 512 frames per loop you are feeding in 512/0.5 frames = 1024 frames. That way when the API runs its src_process function, it compresses those 1024 frames into 512, speeding up the samples. So I dont fully understand why it solved my issue, but the problem was if the ratio is say 0.7, you end up with a float number which doesn't work with the arrays indexed int values. Therefore there's missing samples unless the src ratio is eqaully divisble by the framesperbuffer potentially at the end of each block. So what I did was add +2 frames to be read if the framesperbuffer%src.ratio != 0 and it seemed to fix 99% of the glitches.
/* This if Statement Ensures Smooth VariSpeed Output */
if (fmod((double)framesPerBuffer, data->src_data.src_ratio) == 0)
{
numInFrames = framesPerBuffer;
}
else
numInFrames = (framesPerBuffer/data->src_data.src_ratio) + 2;
/* Read FramesPerBuffer Amount of Data from inFile into buffer[] */
numberOfFrames = sf_readf_float(data->inFile, data->src_inBuffer, numInFrames);

Resources