Erratic timing results from mach_absolute_time() - ios

I'm trying to optimize a function (an FFT) on iOS, and I've set up a test program to time its execution over several hundred calls. I'm using mach_absolute_time() before and after the function call to time it. I'm doing the tests on an iPod touch 4th generation running iOS 6.
Most of the timing results are roughly consistent with each other, but occasionally one run will take much longer than the others (as much as 100x longer).
I'm pretty certain this has nothing to do with my actual function. Each run has the same input data, and is a purely numerical calculation (i.e. there are no system calls or memory allocations). I can also reproduce this if I replace the FFT with an otherwise empty for loop.
Has anyone else noticed anything like this?
My current guess is that my app's thread is somehow being interrupted by the OS. If so, is there any way to prevent this from happening? (This is not an app that will be released on the App Store, so non-public APIs would be OK for this.)
I no longer have an iOS 5.x device, but I'm pretty sure this was not happening prior to the update to iOS 6.
EDIT:
Here's a simpler way to reproduce:
for (int i = 0; i < 1000; ++i)
{
uint64_t start = mach_absolute_time();
for (int j = 0; j < 1000000; ++j);
uint64_t stop = mach_absolute_time();
printf("%llu\n", stop-start);
}
Compile this in debug (so the for loop is not optimized away) and run; most of the values are around 220000, but occasionally a value is 10 times larger or more.

In my experience, mach_absolute_time is not reliable. Now I use CFAbsoluteTime instead. It returns the current time in seconds with a much better precision than the second.
const CFAbsoluteTime newTime = CFAbsoluteTimeGetCurrent();

mach_absolute_time() is actually very low level and reliable. It runs at a steady 24MHz on all iOS devices, from the 3GS to the iPad 4th gen. It's also the fastest way to get timing information, taking between 0.5µs and 2µs depending on CPU. But if you get interrupted by another thread, of course you're going to get spurious results.
SCHED_FIFO with maximum priority will allow you to hog the CPU, but only for a few seconds at most, then the OS decides you're being too greedy. You might want to try sleep( 5 ) before running your timing test, as this will build up some "credit".
You don't actually need to start a new thread, you can temporarily change the priority of the current thread with this:
struct sched_param sched;
sched.sched_priority = 62;
pthread_setschedparam( pthread_self(), SCHED_FIFO, &sched );
Note that sched_get_priority_min & max return a conservative 15 & 47, but this only corresponds to an absolute priority of about 0.25 to 0.75. The actual usable range is 0 to 62, which corresponds to 0.0 to 1.0.

It happens while app spend some time in another threads.

Related

DXGI Waitable SwapChain not waiting (D3D11)

there is another question with the same title on the site, but that one didn't solve my problem
I'm writing a Direct3D 11 desktop application, and I'm trying to implement waitable swap chain introduced by this document to reduce latency (specifically, the latency between when user moved the mouse and when the monitor displayed the change)
Now the problem is, I called WaitForSingleObject on the handle returned by GetFrameLatencyWaitableObject, but it did not wait at all and returned immediately, (which result in my application getting about 200 to 1000 fps, when my monitor is 60Hz) so my questions are:
Did even I understand correctly what a waitable swap chain does? According to my understanding, this thing is very similar to VSync (which is done by passing 1 for the SyncInterval param when calling Present on the swap chain), except instead of waiting for a previous frame to finish presenting on the screen at the end of a render loop (which is when we're calling Present), we can wait at the start of a render loop (by calling WaitForSingleObject on the waitable object)
If I understood correctly, then what am I missing? or is this thing only works for UWP applications? (because that document and its sample project are in UWP?)
here's my code to create swap chain:
SwapChainDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM;
SwapChainDesc.Stereo = false;
SwapChainDesc.SampleDesc.Count = 1;
SwapChainDesc.SampleDesc.Quality = 0;
SwapChainDesc.BufferUsage = D3D11_BIND_RENDER_TARGET;
SwapChainDesc.BufferCount = 2;
SwapChainDesc.Scaling = DXGI_SCALING_STRETCH;
SwapChainDesc.SwapEffect = DXGI_SWAP_EFFECT_FLIP_DISCARD;
SwapChainDesc.AlphaMode = DXGI_ALPHA_MODE_UNSPECIFIED;
SwapChainDesc.Flags = DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT;
result = Factory2->CreateSwapChainForHwnd(Device.Get(), hWnd, &SwapChainDesc, &FullscreenDesc, nullptr, &SwapChain1);
if (FAILED(result)) return result;
here's my code to get waitable object:
result = SwapChain2->SetMaximumFrameLatency(1); // also tried setting it to "2"
if (FAILED(result)) return result;
WaitableObject = SwapChain2->GetFrameLatencyWaitableObject(); // also, I never call ResizeBuffers
if (WaitableObject == NULL) return E_FAIL;
and here's my code for render loop:
while (Running) {
if (WaitForSingleObject(WaitableObject, 1000) == WAIT_OBJECT_0) {
Render();
HRESULT result = SwapChain->Present(0, 0);
if (FAILED(result)) return result;
}
}
So I took some time to download and test the official sample, now I think I'm ready to answer my own questions:
No, waitable swap chain does not work like how I think, it does not wait until a previous frame is presented on the monitor. Instead, I think what it does is probably to wait until all the work before Present are finished (GPU finished rendering to render target, but not yet displayed it on the monitor) or queued (CPU finished sending GPU all the commands, but GPU haven't finished executing them yet) I'm not sure which one is the real case, but either one, in theory, would help reduce input latency (and according to my tests, it did, both when VSync is on and off), also, now that I know that this thing has almost nothing to do with framerate control, I know now that it shouldn't be compared with VSync.
I don't think it's limited to UWP
And now I'd like to share some ideas that I have concluded for myself about input latency and framerate control:
I now believe that the concept of reducing input latency and the concept of framerate control are mutually exclusive and that a perfect balance point between them probably doesn't exist;
for example, if I want to limit framerate to 1 frame per "vblank", then the input latency (in an ideal scenario) would be as high as the monitor frame latency (which is about 16ms for a 60hz monitor);
but when I don't limit framerate, the input latency would be as high as how long a GPU would take to finish a frame (which in an ideal scenario, about 1 or 2ms, which is significantly faster not only in numbers, the improvement is visible to user's perspective as well), but a lot of frames (and CPU/GPU resources used to render them) would be wasted
As an FPS game player myself, the reason why I want to reduce input latency is obvious, is because I hate input lag;
and the reasons why I want to invest in framerate control are: firstly, I hate frame tearing (a little more than how much I hate input lag), secondly, I want to ease CPU/GPU usage when possible.
However, recently I discovered that frame tearing is perfectly defeated by using flip model (I just don't get any tearing at all when using flip model, no VSync needed), so I don't need to worry about tearing anymore.
So I now plan to prioritize latency reduction rather than framerate control until when and if one day I move on to D3D12 to figure out a way to ease CPU/GPU usage while preserving low input latency.

(When) Does CACurrentMediaTime/mach_system_time wrap around on iOS?

To get accurate time measurements on iOS, mach_absolute_time() should be used. Or CACurrentMediaTime(), which is based on mach_absolute_time(). This is documented in this Apple Q&A, and also explained in several StackOverflow answers (e.g. https://stackoverflow.com/a/17986909, https://stackoverflow.com/a/30363702).
When does the value returned by mach_absolute_time() wrap around? When does the value returned by CACurrentMediaTime() wrap around? Does this happen in any realistic timespan? The return value of mach_absolute_time() is of type uint64, but I'm unsure about how this maps to a real timespan.
The document you reference notes that mach_absolute_time is CPU dependent, so we can't say how much time must elapse before it wraps. On the simulator, mach_absolute_time is nanoseconds, so if it's wrapping at UInt64.max, that translates to 585 years. On my iPhone 7+, it's 24,000,000 mac_absolute_time per second, which translates to 24 thousand years. Bottom line, the theoretical maximum amount of time captured by mach_absolute_time will vary based upon CPU, but you won't ever encounter this in any practical application.
For what it's worth, consistent with those various posts you found, the CFAbsoluteTimeGetCurrent documentation warns that:
Repeated calls to this function do not guarantee monotonically increasing results. The system time may decrease due to synchronization with external time references or due to an explicit user change of the clock.
So, you definitely don't want to use NSDate/Date or CFAbsoluteTimeGetCurrent if you want accurate elapsed times. Neither ensures monotonically increasing values.
In short, when I need that sort of behavior, I generally use CACurrentMediaTime, because it enjoy the benefits of mach_absolute_time, but it converts it to seconds for me, which makes it very simple to use. And neither it nor mach_absolute_time are going to loop in any realistic time period.

Calclulate Power consumption when CPU don't change to LPM Mode in Contiki

I need to calculate power consumption of CPU. According to this formula.
Power(mW) = cpu * 1.8 / time.
Where time is the sum of cpu + lpm.
I need to measure at the start of certain process and at the end, however the time passed it is to short, and cpu don't change to lpm mode as seen in the next values taken with powertrace_print().
all_cpu all_lpm all_transmit all_listen
116443 1514881 148 1531616
17268 1514881 148 1532440
Calculating power consumption of cpu I got 1.8 mW (which is exactly the value of current draw of CPU in active mode).
My question is, how calculate power consumption in this case?
If MCU does not go into a LPM, then it spends all the time in active mode, so the result of 1.8 mW you get looks correct.
Perhaps you want to ask something different? If you want to measure the time required to execute a specific block of code, you can add RTIMER_NOW() calls at the start and end of the block.
The time resolution of RTIMER_NOW may be too coarse for short-time operations. You can use a higher frequency timer for that, depending on your platform, e.g. read the TBR register for timing if you're compiling for a msp430 based sensor node.

Problem with the timings of a program that uses 1-8 threads on a server that has 4 Dual Core Cpu's?

I am runing a program on a server at my university that has 4 Dual-Core AMD Opteron(tm) Processor 2210 HE and the O.S. is Linux version 2.6.27.25-78.2.56.fc9.x86_64. My program implements Conways Game of Life and it runs using pthreads and openmp. I timed the parrallel part of the program using the getimeofday() function using 1-8 threads. But the timings don't seem right. I get the biggest time using 1 thread(as expected), then the time gets smaller. But the smallest time I get is when I use 4 threads.
Here is an example when I use an array 1000x1000.
Using 1 thread~9,62 sec, Using 2 Threads~4,73 sec, Using 3 ~ 3.64 sec, Using 4~2.99 sec, Using 5 ~4,19 sec, Using 6~3.84, Using 7~3.34, Using 8~3.12.
The above timings are when I use pthreads. When I use openmp the timing are smaller but follow the same pattern.
I expected that the time would decrease from 1-8 because of the 4 Dual core cpus? I thought that because there are 4 cpus with 2 cores each, 8 threads could run at the same time. Does it have to do with the operating system that the server runs?
Also I tested the same programs on another server that has 7 Dual-Core AMD Opteron(tm) Processor 8214 and runs Linux version 2.6.18-194.3.1.el5. There the timings i get are what I expected. The timings get smaller starting from 1(the biggest) to 8(smallest execution time).
The program implements the Game of Life correct, both using pthreads and openmp, I just can't figure out why the timings are like the example I posted. So in conclusion, my questions are:
1) The number of threads that can run at the same time on a system depends by the cores of the cpus? it depends only by the cpus although each cpu has more than one cores? It depends by all the previous and the Operating System?
2) Does it have to do with the way I divide the 1000x1000 array to the number of threads? But if I did then the openmp code wouldn't give the same pattern of timings?
3) What is the reason I might get such timings?
This is the code I use with openmp:
#define Row 1000+2
#define Col 1000+2 int num; int (*temp)[Col]; int (*a1)[Col]; int (*a2)[Col];
int main() {
int i,j,l,sum;
int array1[Row][Col],array2[Row][Col];
struct timeval tim;
struct tm *tm;
double start,end;
int st,en;
for (i=0; i<Row; i++)
for (j=0; j<Col; j++)
{
array1[i][j]=0;
array2[i][j]=0;
}
array1[3][16]=1;
array1[4][16]=1;
array1[5][15]=1;
array1[6][15]=1;
array1[6][16]=1;
array1[7][16]=1;
array1[5][14]=1;
array1[4][15]=1;
a1=array1;
a2=array2;
printf ("\nGive number of threads:");
scanf("%d",&num);
gettimeofday(&tim,NULL);
start=tim.tv_sec+(tim.tv_usec/1000000.0); omp_set_num_threads(num);
#pragma omp parallel private(l,i,j,sum)
{
printf("Number of Threads:%d\n",omp_get_num_threads());
for (l=0; l<100; l++)
{
#pragma omp for
for (i=1; i<(Row-1); i++)
{
for (j=1; j<(Col-1); j++)
{
sum=a1[i-1][j-1]+a1[i-1][j]+a1[i-1][j+1]+a1[i][j-1]+a1[i][j+1]+a1[i+1][j-1]+a1[i+1][j]+a1[i+1][j+1];
if ((a1[i][j]==1) && (sum==2||sum==3))
a2[i][j]=1;
else if ((a1[i][j]==1) && (sum<2))
a2[i][j]=0;
else if ((a1[i][j]==1) && (sum>3))
a2[i][j]=0;
else if ((a1[i][j]==0 )&& (sum==3))
a2[i][j]=1;
else if (a1[i][j]==0)
a2[i][j]=0;
}//end of iteration J
}//end of iteration I
#pragma omp barrier
#pragma omp single
{
temp=a1;
a1=a2;
a2=temp;
}
#pragma omp barrier
}//end of iteration L
}//end of paraller region
gettimeofday(&tim,NULL);
end=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("\nTime Elapsed:%.6lf\n",end-start);
printf("all ok\n");
return 0; }
TIMINGS with openmp code
a)System with 7 Dual Core Cpus
Using 1 thread~7,72 sec, Using 2 threads~4,53 sec, Using 3 Threads~3,64 sec, Using 4 threads~ 2,24 sec, Using 5~2,02 sec, Using 6~ 1,78 sec, Using 7 ~1,59 sec,Using 8 ~ 1,44 sec
b)System with 4 Dual Core Cpus
Using 1 thread~9,06 sec, Using 2 threads~4,86 sec, Using 3 Threads~3,49 sec, Using 4 threads~ 2,61 sec, Using 5~3,98 sec, Using 6~ 3,53 sec, Using 7 ~3,48 sec,Using 8 ~ 3,32 sec
Above are the timings I get.
One thing you have to remember is that you're doing this on a shared memory architecture. The more loads/stores you are trying to do in parallel, the more chance you're going to have to hit contention with regards to memory access, which is a relatively slow operation. So in typical applications in my experience, don't benefit from more than 6 cores. (This is anecdotal, I could go into a lot of detail, but I don't feel like typing. Suffice to say, take these numbers with a grain of salt).
Try instead to minimize access to shared resources if possible, see what that does to your performance. Otherwise, optimize for what you got, and remember this:
Throwing more cores at a problem does not mean it will go quicker. Like with taxation, there's a curve as to when the number of cores, starts becoming a detriment to collecting the most performance out of your program. Find that "sweet spot", and use it.
You write
The above timings are when i use
pthreads. When i use openmp the timing
are smaller but follow the same
pattern.
Congratulations, you have discovered the pattern which all parallel programs follow ! If you plot execution time against number of processors the curve eventually flattens out and starts to rise; you reach a point where adding more processors slows things down.
The interesting question is how many processors you can profitably use and the answer to this is dependent on many factors. #jer has pointed out some of the factors which affect the scalability of programs on shared-memory computers. Other factors, principally the ratio of communication to computation, ensure that the shape of the performance curve will be the same on distributed-memory computers too.
The other factor which is important when measuring the parallel scalability of your program is the problem size(s) you use. How does your performance curve change when you try a grid of 1414 x 1414 cells ? I would expect that the curve will be below the curve for the problem on 1000 x 1000 cells and will flatten out later.
For further reading Google for Amdahl's Law and Gustafson's Law.
Could be your sysadmin is controlling how many threads you can execute simultaneously or how many cores you run on. I don't know if it is possible at the sysadmin level, but it sure is possible to tell a process that.
Or, your algorithm could be using L2 cache poorly. Hyper-threading or whatever they call it now works best when one thread is doing something that takes a long time and the other thread is not. Accessing memory not in L2 cache is SLOW and the thread doing so will stall while it waits. This is just one example of where the time to run multiple threads on a single core comes from. A Quad core memory bus might allow each core to access some of the ram at the same time, but not each thread in each core. If both threads go for RAM then they basically are running sequentially. So that could be where your 4 comes from.
You might look to see if you can change your loops so they operate on contiguous RAM. If you break the problem into small blocks of data that fit in your L2 cache and iterate through those blocks, you might get 8x. If you search for the intel machine language programmers guides for their latest processors...they talk about these issues.

how to get page size

I was asked this question in an interview Plz tell me the answer :-
You have no documentation of the kernel. You only knows that you kernel supports paging.
How will you find that page size ? There is no flag or macro you have that can tell you about page size.
I was given the hint as you can use Time to get the answer. I still have no clue for it.
Run code like the following:
for (int stride = 1; stride < maxpossiblepagesize; stride += searchgranularity) {
char* somemem = (char*)malloc(veryverybigsize*stride);
starttime = getcurrentveryaccuratetime();
for (pos = somemem; pos < somemem+veryverybigsize*stride; pos += stride) {
// iterate over "veryverybigsize" chunks of size "stride"
*pos = 'Q'; // Just write something to force the page back into physical memory
}
endtime = getcurrentveryaccuratetime();
printf("stride %u, runtime %u", stride, endtime-starttime);
}
Graph the results with stride on the X axis and runtime on the Y axis. There should be a point at stride=pagesize, where the performance no longer drops.
This works by incurring a number of page faults. Once stride surpasses pagesize, the number of faults ceases to increase, so the program's performance no longer degrades noticeably.
If you want to be cleverer, you could exploit the fact that the mprotect system call must work on whole pages. Try it with something smaller, and you'll get an error. I'm sure there are other "holes" like that, too - but the code above will work on any system which supports paging and where disk access is much more expensive than RAM access. That would be every seminormal modern system.
It looks to me like a question about 'how does paging actually work'
They want you to explain the impact that changing the page size will have on the execution of the system.
I am a bit rusty on this stuff, but when a page is full, the system starts page swapping, which slows everything down. So you want to run something that will fill up the memory to different sizes, and measure the time it takes to do a task. At some point there will be a jump, where the time taken to do the task will suddenly jump.
Like I said I am a bit rusty on the implementation of doing this. But i'm pretty sure that is the shape of the answer they were after.
Whatever answer they were expecting it would almost certainly be a brittle solution. For one thing you can have multiple pages sizes so any answer you may have gotten for one small allocation may be irrelevant for the next multi-megabyte allocation (see things like Linux's Large Page support).
I suspect the question was more aimed at seeing how you approached the problem rather than the final solution you came up with.
By the way this question isn't about linux because you do have documentation for that as well as POSIX compliance, for which you just call sysconf(_SC_PAGE_SIZE).

Resources