Super slow Image processing on Android tablet - image-processing

I am trying to implement SLIC superpixel algorithm in Android tablet (SLIC)
I port the code which in C++ to work with android environment using stl-lib and all. What application doing is taking an image from camera and send data to process in native code.
I got the app running but the problem is that it took 20-30 second to process a single frame (640 x 400) while in my notebook running with visual studio application would be almost instantly finish!
I check the memory leak, their isn't any... is their anything that might cause computation time to be way more expensive than VS2010 in notebook?
I know this question might be very open and not really specific but I'm really in the dark too. Hope you guys can help.
Thanks
PS. I check running time for each process, I think that every line of code execution time just went up. I don't see any specific function that take way longer than usual.
PSS. Do you think follow may cause the slow?
Memory size : investigated, during native not much of paused time show from GC
STL-library : not investigate yet, is it possible that function like vector, max and min running in STL may cause significant slow?
Android environment it self?
Lower hardware specification of Android Tablet (Acer Iconia tab - 1GHz Nvidia Tegra 250 dual-core processor and has 1GB of RAM)
Would be better to run in Java?
PSSS. If you have time please check out the code

I've taken a look to your code and can make the following recommendations:
First of all, you need to add the line APP_ABI := armeabi-v7a into your Application.mk file. Otherwise your code is compiled for old armv5te architecture where you have no any FPU (all floating point arithmetic is emulated), have less registers available and so on.
Your SLIC implementation intensively uses double floating-point values for computation. You should replace them with float wherever possible because ARM still misses hardware support for double type.

Related

Enforce use of independent flip mode with DXGI FLIP SwapChain

I currently face a problem with DXGI Swapchains (DirectX 11). My C++ application shows (live) video and my goal is to minimize latency. I have no user input to process.
In order to decrease latency I switched to a DXGI_SWAP_EFFECT_FLIP_DISCARD swapchain (I used BitBlt before - see For best performance, use DXGI flip model for further details). I use the following flags:
//Swapchain Init:
sc.SwapEffect = DXGI_SWAP_EFFECT_FLIP_DISCARD;
sc.Flags = DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT | DXGI_SWAP_CHAIN_FLAG_ALLOW_TEARING;
//Present:
m_pDXGISwapChain->Present(0, DXGI_PRESENT_ALLOW_TEARING);
On one computer the swapchain (windowed) goes into the "Hardware: Independent Flip" mode and I have perfectly low latency, as long as I have no other windows in front. On all other computers I tried, I am stuck in the "Composed: Flip" mode with higher latency (~ 25-30 ms more). The software (binary) is exactly the same. I am checking this with PresentMon.
What I find interesting is that on the computer where the independent flip mode works, it is also active without the ALLOW_TEARING flags - from what I understood they should be required for it. I btw also see tearing in this case, but that is a different problem.
I already tried to compare Windows 10 versions, graphic drivers and driver settings. GPU is a Quadro RTX 4000 for all systems. I couldn't spot any difference between the systems.
I would really appreciate any hints on additional preconditions for the independent flip mode I might have missed in the docs. Thanks for your help!
Update1: I updated the "working" system from Nvidia driver 511.09 to 473.47 (latest stable). After that I got the same behavior like on the other systems (no ind. flip). After going back to 511.09 it worked again. So the driver seems to have influence. The other systems also had 511.09 for my original tests though.
Update2: After dealing with all DirectX debug outputs, it still does not work as desired. I manage to get into independent flip mode only in real full screen mode or in windowed mode where the window has no decorations and actually covers the whole screen. Unfortunately, using the Graphics Tools for VS I never enter the independent flip and cannot do further analysis here. But it is interesting that when using the Graphics Tools debug, PresentMon shows Composed Flip, but the Graphics Analyzer from the Graphics Tools shows only DISCARD as SwapEffect for the SwapChain. I would have expected FLIP_DISCARD as I explicitly used DXGI_SWAP_EFFECT_FLIP_DISCARD.

Alternatives to D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS?

This is a followon to this question about using the DX11VideoRenderer sample (a replacement for EVR that uses DirectX11 instead of EVR's DirectX9).
I've been trying to track down why it uses so much more CPU than the EVR. Task Manager shows me that most of that time is kernel mode.
Using profiling tools, I see that a LOT of time is being spent in numerous calls to NtDelayExecution (aka Sleep). How many calls? ~100,000 over the course of ~12 seconds. Ok, yeah, I'm sending a lot of frames in those 12 seconds, but that's still a lot of calls, every one of which requires a kernel mode transition.
The callstack shows the last call in "my" code is to IDXGISwapChain1::Present(0, 0). The actual call seems be Sleep(0) and comes from nvwgf2umx.dll (which is why this question is tagged NVidia: hopefully someone there can call up the code and see what the logic is behind such frequent calls).
I couldn't quite figure out why it would need to do /any/ Sleeping during Present. It's not like we wait for vertical retrace anymore, is it? But the other reason to use Sleep has to do with yielding to other threads. Which led me to a serious clue:
If I use D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS, the CPU utilization drops. Along with some other fixes, the DX11 version is now faster and uses less CPU time than the DX9 version (which is what I would hope/expect). Profiling shows that Sleep has dropped from >30% to <1%.
Unfortunately, this page tells me:
This flag is not recommended for general use.
Oh.
So, any ideas on how to get decent performance without using debug flags?

OpenGL ES apps appear to run MUCH FASTER when profiling in Instruments

I'm scared to ask this question because it doesn't include specifics and doesn't have any code samples, but that's because I've encountered it on three entirely different apps that I've worked on in the past few weeks, and I'm thinking specific code might just cloud the issue.
Scoured the web and found no reference to the phenomenon I'm encountering, so I'm just going to throw this out there and hope someone else has seen the same thing:
The 'problem' is that all the iOS OpenGL apps I've built, to a man, run MUCH FASTER when I'm profiling them in Instruments than when they're running standalone. As in, a frame rate roughly twice as fast (jumping from, eg, 30fps to 60fps). This is both measured with a code-timing loop and from watching the apps run. Instruments appears to be doing something magical.
This is on a device, not the iOS simulator.
If I profile my OpenGL apps and upload to a device — specifically, iPad 3 running iOS 5.1 — via Instruments, the frame rate is just flat-out much, much faster than running standalone. There appears to be no frame skipping or shennanigans like that. It simply does the same computation at around twice the speed.
Although I'm not including any code samples, just assume I'm doing the normal stuff. OpenGL ES 2.0, with VBOs and VAOs. Multithreading some computationally intensive code areas with dispatch queues/blocks. Nothing exotic or crazy.
I'd just like to know if anyone has experienced anything vaguely similar. If not, I'll just head back to my burrow and continue stabbing myself in the leg with a fork.
Could be that when you profile, a release build is used (by default) instead of a debug build when you just hit run.

Windows to embedded port: data and code memory size

I am in the process of porting a windows 7 library to an embedded platform. In order to do so my employer asks me the amount of memory (and CPU but let us concentrate on the memory for now) that my system will need once ported - so he can size the board to my needs.
I had a look on the internet and there seem not be exist much information about this question, hence my questions:
in order to get a rough idea of the memory footprint of the code in flash memory (code only without memory for data), I read on the Internet that I should sum the size of all the dlls I use. It seems that all compilers and platforms give a different size for the code footprint but overall the size of the code (without data) is often very close. Do you confirm?
in order to deal with the memory required by the data only (heap + stack but no code), I had a look at the task manager (and process explorer). It seems the overall amount of data which I use is specified in the 'peak working set'. I have a few questions about it though:
2.a. Does the 'working set' include the heap + stack memory or does it correspond to the heap only?
2.b. Does the 'working set' include the size for the code as well? (as I am on windows 7, the code is also stored in RAM and not in flash as on embedded systems), or does it only correspond to the data?
2.c. it seems the 'peak working set' reflects the maximum amount of physical memory that was actually in RAM from the time the program was started, but it does not reflect the size the program could take afterwards (if I happen to allocate memory at runtime - which would be bad ;) - the peak value would go on increasing). Do you confirm?
2.d. Hence, do you also confirm that if I do not allocate memory at runtime, the 'peak working set' should roughly be the maximum size of RAM my embedded system will need? Up to a bit of size difference due to the difference in systems technology...
Thanks,
Antoine.
Unless you are intending to run your application on Windows Embedded, then looking at the code and data usage in Windows is not going to be much of an indicator of anything useful!
1) DLLs are libraries - not all the code within them will be utilised by your code. Most embedded systems are statically linked and the linker will link only modules that are actually referenced in your by your code. So taking the sum of the DLL dependencies is likley to lead to a gross over estimation of memory requirement.
2) Windows memory management is profligate with memory use - because it can be and to do so generally improves performance of typical desktop systems. For example, an thread stack in Windows is typically of the order to 2Mb - you may seldom use that much, but Windows gives it to you in any case because it can and to do so errs on the side of safety. A thread stack in an embedded system will typically range from a few tens of bytes to a few tens of kilobytes - it depends on your application.
Windows task manager shows what Windows allocates to your process, that may not relate to what your process needs. Also your application is using Windows services - all the memory used for kernel and device services will not show up as part of your process, but your embedded system may still need those.
If you do use your Windows prototype code to assess the embedded system requirements, then your best place to start is by getting the linker to generate a map file, which will give a detailed description of memory usage in terms of statically allocated data and code size.
Code size depends not only on the performance of the compiler, but also on the efficiency of the instruction set. Some architectures achieve higher code density than others. Windows application code size is never a good indicator of embedded code size because its execution environment is likley to be so much different. For example an pre-emptive multitasking RTOS kernel on a 32bit ARM can be implemented in less than 10Kb of code, a file system perhaps another 10, and network stack anything from 10 to 30K, USB another 10. As you can see this is a different world to desktop code.
Data memory usage is more easily determined perhaps; but you do that through analysis of your application rather than observing what Windows does. There is the data your application instantiates directly, and then there is data instantiated by libraries and device drivers you might call - in Windows the latter is likley to be relatively large and out of your control. Typical embedded systems libraries for things such a s network stacks, USB, file systems etc. are fall smaller and far more deterministic in both performance and size.
Your better bet is to describe your application in terms of its general purpose, performance requirements, real-time constraints, and its hardware requirements (display, networking, I/O, mass storage etc.), and then look at comparable solutions or at the libraries you will need to implement your solution; most embedded systems are "bare board" and do not have the services you find in Windows unless you write them or use third-party solutions - Windows is seldom a comparable solution to an embedded system.
If it is just a library rather than an application, then build it for a likley target using a Windows hosted GCC cross-compiler and see how big it ends up. You don't need hardware for that or even expend any money.

AIR for iOS: Power saving

What coding tricks, compilation flags, software-architecture considerations can be applied in order to keep powerconsumption low in an AIR for iOS application (or to reduce powerconsumption in an existing application that burns too much battery)?
One of the biggest things you can do is adjust the framerate based off of the app state.
You can do this by adding handlers inside your App.mxml
<s:ViewNavigatorApplication xmlns:fx="http://ns.adobe.com/mxml/2009"
xmlns:s="library://ns.adobe.com/flex/spark"
activate="activate()" deactivate="close()" />
Inside your activate and close methods
//activate
FlexGlobals.topLevelApplication.frameRate = 24;
//deactivate
FlexGlobals.topLevelApplication.frameRate = 2;
You can also play around with this number depending on what your app is currently doing. If you're just displaying text try lowering your fps. This should give you the most bang for your buck power savings.
Generally, high power consumption can be the result of:
intensive network usage
no sleep mode for display while app idle
un-needed location services usage
continously high usage of cpu
Regarding (flex/flash) AIR I would suggest that:
First you use the Flex profiler + task-manager and monitor CPU and Memory usage. Try to reduce them as much as possible. As soon as you have this low on windows/mac they will go lower (theoretically on mobile devices)
Next step would be to use a network monitor and reduce the amount and size of the network (webservice) calls. Try to identify unneeded network activity and eliminate it.
Try to detect any idle state of the app (possible in flex, not sure about flash) and maybe put the whole app in an idle mode (if you have fireworks animation running then just call stop())
Also I am not sure about it, but will reduce for sure cpu and use more gpu: by using Stage3D (now available with air 3.2 also for mobile) when you do complex anymations. It may reduce execution time since HW accel is there, therefore power consumption may be lower.
If I am wrong about something please comment/downvote (as you like) but this is my personal impression.
Update 1
As prompted in the comments, there is not a 100% link between cpu usage on a desktop and on a mobile device, but "theoreticaly" on the low level we should have at least the same cpu usage trend.
My tips:
Profile your App in a first step with the profiler from the Flash Builder
If you have a Mac, profile your app with Instruments from XCode
And important:
behaviors of Simulator, IPA-Interpreter packages and IPA-Test builds are different.
Simulator - pro forma optimizations
IPA-Interpreter - Get a feeling of performance
IPA-Test - "real" performance behavior
And finally test the AppStore-Build, it is the fastest (in meaning of performance) package mode.
Additional we saw, that all this modes can vary.

Resources