Profiling with an external counter or best alternative - opencv

I'am non-programmer, trying to assess the time spent in (opencv-)functions. We have an AD-converter which comes with a counter that is able to count external signals (e.g. from a function generator) with a frequency of 1 MHz = 1 µs resolution. The actual counter status can be queried with a function cbIn32(..., unsigned long *pointertovalue).
So my idea was to query the counter status before and after calling the function of interest and to calculate then the difference. However, doubts came up when I calcultated the difference without a function call in between, which revealed rel. high fluctuations (values between 80 and 400 µs or so). I wondered, if calculating the average time for calling cbIn32() (approx. 180 µs) and substract this from the putative time spent in the function of interest is a valid solution.
So my first two questions:
Is that approach generally feasible or useless?
Where do the fluctuations come from?
Alternatively, we tried using getTickCount(), which seemed to deliver reasonable values. But checking forums revealed that it has a low resolution of about 10 ms, which would be insatisfactory (100 µs resolution would be appreciated). However, the values we got were in the sub-ms range.
This brings me to the next questions:
How can the time assessed for a function with getTickCount() be in the microseconds range, when the resolution is around 10 ms?
Should I trust the obtained values or not?
I also tried it with gprof, but it gave me "no time accumulated", although I am sure that the time spent in a function containing opencv-related calls is at least a few milliseconds. I even tried rebuilding opencv with ENABLE_PROFILING=ON, but same result. I read somewhere that you need to build static opencv libraries to enable profiling, but I am not sure if this would improve the situation. So the question here is:
What do I have to do so that gprof also "sees" opencv functions?
Next alternative would be the QueryPerformanceCounter() function of the WINAPI. I don't how to use it, but I would fight my way through, if you recommend it. Question to that approach:
Will it be problematic because of multiple cores?
If yes, is there an "easy" way to handle that problem?
I also tried it with verysleepy, but it exits somehow to early (worked fine with other .exe).
Newbie-friendly answers would be very, very appreciated. My goal is to find the easiest approach with highest precision. I'm working on Win7 64bit, Eclipse with MinGW.
Thx for your help...


Odd Metal shader profiler results with atomic functions

I've implemented a minimal test compute shader in order to get a feel for the performance of Metal atomic functions, specifically atomic_fetch_add_explicit and atomic_compare_exchange_weak_explicit using atomic_uint in the device address space.
Running a compute kernel which runs atomic_fetch_add_explicit once, 307200 times, takes around 50 µs on an iPhone 12 Pro. Running atomic_fetch_add_explicit 10 times per thread instead takes about 650 µs, which makes sense. However, I'm confused by the Xcode shader profiler's line-by-line performance metrics:
Blue, red and yellow mean arithmetic, synchronisation and control flow, respectively. According to Apple's documentation, this result means the atomic function calls each take zero percent of the function's total elapsed time:
The statistics for lines in the function body indicate the time as a percent of the function's total elapsed time.
However clearly that's not the case, as my time expenditure is multiplied roughly ten-fold as I go from one to ten calls per thread.
Is this an Xcode bug, or am I missing something here? Answers appreciated, answers with sources doubly appreciated. :)

Prometheus increase not handling process restarts

I am trying to figure out the behavior of Prometheus' increase() querying function with process restarts.
When there is a process restart within a 2m interval and I query:
I get a value less than expected.
For example, in a simple experiment I mock:
3 lcm_restarts
1 process restart
2 lcm_restarts
All within a 2 minute interval.
Upon querying:
I receive a value of ~4.5 when I am expecting 5.
lcm_restarts graph
sum(increase(lcm_restarts[2m])) result
Could someone please explain?
Pretty concise and well-prepared first question here. Please keep this spirit!
When working with counters, functions as rate(), irate() and also increase() are adjusting on resets due to restarts. Other than the name suggests, the increase() function does not calculate the absolute increase in the given time frame but is a different way to write rate(metric[interval]) * number_of_seconds_in_interval. The rate() function takes the first and the last measurement in a series and calculates the per-second increase in the given time. This is the reason why you may observe non-integer increases even if you always increase in full numbers as the measurements are almost never exactly at the start and end of the interval.
For more details about this, please have a look at the prometheus docs for the increase() function. There are also some good hints on what and what not to do when working with counters in the robust perception blog.
Having a look at your label dimensions, I also think that counter resets don't apply to your constructed example. There is one label called reason that changed between the restarts and so created a second time series (not continuing the existing one). Here you are also basically summing up the rates of two different time series increases that (for themselves) both have their extrapolation happening.
So basically there isn't really anything wrong what you are doing, you just shouldn't rely on getting highly precise numbers out of prometheus for your use case.
Prometheus may return unexpected results from increase() function due to the following reasons:
Prometheus may return fractional results from increase() over integer counter because of extrapolation. See this issue for details.
Prometheus may return lower than expected results from increase(m[d]) because it doesn't take into account possible counter increase between the last raw sample just before the specified lookbehind window [d] and the first raw sample inside the lookbehind window [d]. See this article and this comment for details.
Prometheus skips the increase for the first sample in a time series. For example, increase() over the following series of samples would return 1 instead of 11: 10 11 11. See these docs for details.
These issues are going to be fixed according to this design doc. In the mean time it is possible to use other Prometheus-like systems such as VictoriaMetrics, which are free from these issues.

Largest amount of entries in lua table

I am trying to build a Sieve of Eratosthenes in Lua and i tried several things but i see myself confronted with the following problem:
The tables of Lua are to small for this scenario. If I just want to create a table with all numbers (see example below), the table is too "small" even with only 1/8 (...) of the number (the number is pretty big I admit)...
max = 600851475143
numbers = {}
for i=1, max do
table.insert(numbers, i)
If I execute this script on my Windows machine there is an error message saying: C:\Program Files (x86)\Lua\5.1\lua.exe: not enough memory. With Lua 5.3 running on my Linux machine I tried that too, error was just killed. So it is pretty obvious that lua can´t handle the amount of entries.
I don´t really know whether it is just impossible to store that amount of entries in a lua table or there is a simple solution for this (tried it by using a long string aswell...)? And what exactly is the largest amount of entries in a Lua table?
Update: And would it be possible to manually allocate somehow more memory for the table?
Update 2 (Solution for second question): The second question is an easy one, I just tested it by running every number until the program breaks: 33.554.432 (2^25) entries fit in one one-dimensional table on my 12 GB RAM system. Why 2^25? Because 64 Bit per number * 2^25 = 2147483648 Bits which are exactly 2 GB. This seems to be the standard memory allocation size for the Lua for Windows 32 Bit compiler.
P.S. You may have noticed that this number is from the Euler Project Problem 3. Yes I am trying to accomplish that. Please don´t give specific hints (..). Thank you :)
The Sieve of Eratosthenes only requires one bit per number, representing whether the number has been marked non-prime or not.
One way to reduce memory usage would be to use bitwise math to represent multiple bits in each table entry. Current Lua implementations have intrinsic support for bitwise-or, -and etc. Depending on the underlying implementation, you should be able to represent 32 or 64 bits (number flags) per table entry.
Another option would be to use one or more very long strings instead of a table. You only need a linear array, which is really what a string is. Just have a long string with "t" or "f", or "0" or "1", at every position.
Caveat: String manipulation in Lua always involves duplication, which rapidly turns into n² or worse complexity in terms of performance. You wouldn't want one continuous string for the whole massive sequence, but you could probably break it up into blocks of a thousand, or of some power of 2. That would reduce your memory usage to 1 byte per number while minimizing the overhead.
Edit: After noticing a point made elsewhere, I realized your maximum number is so large that, even with a bit per number, your memory requirements would optimally be about 73 gigabytes, which is extremely impractical. I would recommend following the advice Piglet gave in their answer, to look at Jon Sorenson's version of the sieve, which works on segments of the space instead of the whole thing.
I'll leave my suggestion, as it still might be useful for Sorenson's sieve, but yeah, you have a bigger problem than you realize.
Lua uses double precision floats to represent numbers. That's 64bits per number.
600851475143 numbers result in almost 4.5 Terabytes of memory.
So it's not Lua's or its tables' fault. The error message even says
not enough memory
You just don't have enough RAM to allocate that much.
If you would have read the linked Wikipedia article carefully you would have found the following section:
As Sorenson notes, the problem with the sieve of Eratosthenes is not
the number of operations it performs but rather its memory
requirements.[8] For large n, the range of primes may not fit in
memory; worse, even for moderate n, its cache use is highly
suboptimal. The algorithm walks through the entire array A, exhibiting
almost no locality of reference.
A solution to these problems is offered by segmented sieves, where
only portions of the range are sieved at a time.[9] These have been
known since the 1970s, and work as follows

(When) Does CACurrentMediaTime/mach_system_time wrap around on iOS?

To get accurate time measurements on iOS, mach_absolute_time() should be used. Or CACurrentMediaTime(), which is based on mach_absolute_time(). This is documented in this Apple Q&A, and also explained in several StackOverflow answers (e.g.,
When does the value returned by mach_absolute_time() wrap around? When does the value returned by CACurrentMediaTime() wrap around? Does this happen in any realistic timespan? The return value of mach_absolute_time() is of type uint64, but I'm unsure about how this maps to a real timespan.
The document you reference notes that mach_absolute_time is CPU dependent, so we can't say how much time must elapse before it wraps. On the simulator, mach_absolute_time is nanoseconds, so if it's wrapping at UInt64.max, that translates to 585 years. On my iPhone 7+, it's 24,000,000 mac_absolute_time per second, which translates to 24 thousand years. Bottom line, the theoretical maximum amount of time captured by mach_absolute_time will vary based upon CPU, but you won't ever encounter this in any practical application.
For what it's worth, consistent with those various posts you found, the CFAbsoluteTimeGetCurrent documentation warns that:
Repeated calls to this function do not guarantee monotonically increasing results. The system time may decrease due to synchronization with external time references or due to an explicit user change of the clock.
So, you definitely don't want to use NSDate/Date or CFAbsoluteTimeGetCurrent if you want accurate elapsed times. Neither ensures monotonically increasing values.
In short, when I need that sort of behavior, I generally use CACurrentMediaTime, because it enjoy the benefits of mach_absolute_time, but it converts it to seconds for me, which makes it very simple to use. And neither it nor mach_absolute_time are going to loop in any realistic time period.

Which Improvements can be done to AnyTime Weighted A* Algorithm?

Firstly , For those of your who dont know - Anytime Algorithm is an algorithm that get as input the amount of time it can run and it should give the best solution it can on that time.
Weighted A* is the same as A* with one diffrence in the f function :
(where g is the path cost upto node , and h is the heuristic to the end of path until reaching a goal)
Original = f(node) = g(node) + h(node)
Weighted = f(node) = (1-w)g(node) +h(node)
My anytime algorithm runs Weighted A* with decaring weight from 1 to 0.5 until it reaches the time limit.
My problem is that most of the time , it takes alot time until this it reaches a solution , and if given somthing like 10 seconds it usaully doesnt find solution while other algorithms like anytime beam finds one in 0.0001 seconds.
Any ideas what to do?
If I were you I'd throw the unbounded heuristic away. Admissible heuristics are much better in that given a weight value for a solution you've found, you can say that it is at most 1/weight times the length of an optimal solution.
A big problem when implementing A* derivatives is the data structures. When I implemented a bidirectional search, just changing from array lists to a combination of hash augmented priority queues and array lists on demand, cut the runtime cost by three orders of magnitude - literally.
The main problem is that most of the papers only give pseudo-code for the algorithm using set logic - it's up to you to actually figure out how to represent the sets in your code. Don't be afraid of using multiple ADTs for a single list, i.e. your open list. I'm not 100% sure on Anytime Weighted A*, I've done other derivatives such as Anytime Dynamic A* and Anytime Repairing A*, not AWA* though.
Another issue is when you set the g-value too low, sometimes it can take far longer to find any solution that it would if it were a higher g-value. A common pitfall is forgetting to check your closed list for duplicate states, thus ending up in a (infinite if your g-value gets reduced to 0) loop. I'd try starting with something reasonably higher than 0 if you're getting quick results with a beam search.
Some pseudo-code would likely help here! Anyhow these are just my thoughts on the matter, you may have solved it already - if so good on you :)
Beam search is not complete since it prunes unfavorable states whereas A* search is complete. Depending on what problem you are solving, if incompleteness does not prevent you from finding a solution (usually many correct paths exist from origin to destination), then go for Beam search, otherwise, stay with AWA*. However, you can always run both in parallel if there are sufficient hardware resources.
