Fastest Inverse Square Root on iPhone

Fastest Inverse Square Root on iPhone - ios

I'm working on an iPhone app that involves certain physics calculations that are done thousands of times per second. I am working on optimizing the code to improve the framerate. One of the pieces that I am looking at improving is the inverse square root. Right now, I am using the Quake 3 fast inverse square root method. After doing some research, however, I heard that there is a faster way by using the NEON instruction set. I am unfamiliar with inline assembly and cannot figure out how to use NEON. I tried implementing the math-neon library but I get compiler errors because most of the NEON-based functions lack return.
EDIT: I've suddenly been getting some "unclear question" close votes. Although I think its quite clear and those who answered obviously understood, maybe some people need it stated explicitly:
How do you use Neon to perform faster calculations? And is it really the fastest method for getting the inverse square root on the iPhone?
EDIT: I did some more formal testing on Neon VS Quake today, but If anything, I'm even more uncertain about the outcome now:
In-App Testing: (An app that is currently in the app store with its invsqrt method modified)
Quake Method (leading by a marginal increase in average FPS under stressful conditions)
Neon (It was a really close call but it seemed that Quake was slightly faster)
1/sqrtf() (a bit more noticeable difference, 1-3 FPS drop).
"Formal" Testing (An app that devours my Phone's CPU. Times how long it takes each method to get through an array of 10000000 randomly generated floats)
Neon (clearly the fastest, and double the speed if it is used to do two sqrts at once).
1/sqrtf() (Only marginally slower than Neon. This surprising result leads me to deem this test "inconclusive" until I investigate further)
Quake (This method, surprisingly, was a few orders of magnitude slower than the other two methods. This is especially surprising given its performance in the other test.)
While quake vs neon was too close to say anything for sure in the app performance test, the quake vs 1/sqrtf() was quite clearly cut out in the first test, and the second test was extremely consistent with the values it outputted. What is important in the end, though, is app performance, so I'm going to make my final decision based on that test.

The accepted answer of the question you've linked already provides the answer, but doesn't spell it out:
#import <arm_neon.h>
void foo() {
float32x2_t inverseSqrt = vrsqrte_f32(someFloat);
}
Header and function are already provided by the iOS SDK.

https://code.google.com/p/math-neon/source/browse/trunk/math_sqrtf.c <- there's a neon implementation of invsqrt there, you should be able to copy the assembly bit as-is

Related

CAN Bus Communication: Same Bit Rate, Different Time Segments

I would like to preface that I am new to CAN, so I apologize if this is an obvious question.
I am using an STM32 microprocessor that has CAN communication, whose bit rate I have set to 500 kBit/s. I am trying to communicate with another node (whose source code I do not have access to) and their bit rates are the same (500 kBit/s). I am wondering if they're using the same bit rate but different bit time parameters (Prescaler, SyncJumpWidth, TimeSeg1, TimeSeg2), will they still be able to communicate with each other?

Possibly, but it isn't certain. Propagation delay caused by long wires for example, will make it more or less critical where the sample point is located. Stubs or poorly terminated buses may suffer from signal reflections that make the sample point location more or less critical. Higher baudrate leads to more bit length inaccuracies. As does a poor clock source. And so on. Also if you go with some strange and exotic settings, nothing tends to work but you get error frames.
I have good experience from following the requirements of the CANopen standard, which is to place the sample point as close to 87.5% as possible. The easiest way to achieve that is by a total of 16tq where phase seg2 is 2tq long. 16 tends to work well with most prescaler clocks as well. Note that there's a hard requirement by the CAN standard not to use more than 25tq.

Hardware/Software rasterizer vs Ray-tracing

I saw the presentation at the High-Perf Graphics "High-Performance Software Rasterization on GPUs" and I was very impressed of the work/analysis/comparison..
http://www.highperformancegraphics.org/previous/www_2011/media/Papers/HPG2011_Papers_Laine.pdf
http://research.nvidia.com/sites/default/files/publications/laine2011hpg_paper.pdf
My background was Cuda, then I started learning OpenGL two years ago to develop the 3d interface of EMM-Check, a field-of-view-analyze program to check if a vehicle is going to fulfill a specific standard or not. essentially you load a vehicle (or different parts), then you can move it completely or separately, add mirrors/cameras, analyze the point of view and shadows for the point of view of the driver, etc..
We are dealing with some transparent elements (mainly the field of views, but also vehicle themselves might be), therefore I wrote some rough algorithm to sort on fly the elements to be rendered (at primitive level, a kind of Painter's algorithm) but of course there are cases in which it easily fails, although for most of cases is enough..
For this reason I started googling, I found many techniques, like (dual) depth peeling, A/R/K/F-buffer, ecc ecc
But it looks like all of them suffer at high resolution and/or large number of triangles..
Since we also deal with millions of triangles (up to 10 more or less), I was looking for something else and I ended up to software renderers, compared to the hw ones, they offer free programmability but they are slower..
So I wonder if it might be possible to implement something hybrid, that is using the hardware renderer for the opaque elements and the software one (cuda/opencl) for the transparent elements and then combining the two results..
Or maybe a simple (no complex visual effect required, just position, color, simple light and properly transparency) ray-tracing algorithm in cuda/opencl might be much simpler from this point of view and give us also a lot of freedom/flexibility in the future?
I did not find anything on the net regarding this... maybe is there any particular obstacle?
I would like to know every single think/tips/idea/suggestion that you have regarding this
Ps: I also found "Single Pass Depth Peeling via CUDA Rasterizer" by Liu, but the solution from the first paper seems fair faster
http://webstaff.itn.liu.se/~jonun/web/teaching/2009-TNCG13/Siggraph09/content/talks/062-liu.pdf

I might suggest that you look at OpenRL, which will let you have hardware-accelerated raytracing?

What are the steps should be taken to make sure that the OpenCV code running on PC will run on a particular embedded device?

I want to port a good OpenCV code on an embedded platform. Earlier such stuffs were very difficult to perform but now TI has come up with nice embedded platforms which are comparatively hassle free as they say.
I want to know following things:
Given that :
The OpenCV code is already running on PC smoothly. (obviously)
Need to determine these before purchasing the device.
Can't put the code here in stackoverflow. :P
To chose from Texas Instruments: C6000.
Questions:
How to make it sure that the porting will be done?
What steps to be taken to make it sure that after porting the code, will run (at least).
to determine whether the code might require some changes to make its run smooth.
The point 3 above is optional.
I need info which will at least give me some start up in this regard.
What I thought I should do?
I am to list the inbuilt functions down.
Then to find available online bench marking for those functions for the particular device like as shown towards the end of this doc.
...
Need to know how to proceed further?
However C6-Integra™ DSP+ARM Processor seems the best.

The best you can do is to try a device simulator (if it is available), but what you'll see there is far from perfect.
Actually, nothing can tell you how fast and how well the app will run on the embedded device before running you specific app on that specific device.
So:
Step 1 Buy it
Step 2 Try it
Things to consider:
embedded CPU architecture: Your app needs a big cache? how big is the embedded cache?
algorithm: do you use a lot of floating point operations? how good is the device at floating point ops?
do you have memory transfers? data bus on a PC is waaay faster than on embedded
hardware support: do you use a lot of double-precision calculations? they are emulated on ARMs. They are gonna kill your app (from millisecons on a PC it can go to seconds on a ARM)
Acceleration. Do your functions use SSE? (many OpenCV funcs are SSEd, even if you don't know). Do you have the NEON counterpart? (OpenCV does not have much support for that). The difference can be orders of magnitude from x86 SSE to embedded without NEON.
and many, many others.
So, again: no one can tell you how it will work. Just the combination between the specific app and the real device tells the truth.
even a run on a similar device is not relevant. It can run smoothly on a given processor, and with another, with similar freq or listed memory, it will slow down too much

This is an interesting question but run is a very generic word in this context, therefore I feel the need to break it down to other 2 questions:
Will it compile in an embedded device?
Will it run as fast/smooth as in a PC?
I've used OpenCV in a lot of different devices, including ARM, SH4, MIPS and I found out that sometimes the manufacturer of the device itself provides a compiled version of OpenCV (for my surprise), which is great. That's something you can look into, maybe the manufacturer of your device provide OpenCV binaries.
There's no way to know for sure how smooth your OpenCV application will be on the target device unless you are able to find some benchmark of OpenCV running in there. PCs have far better processing power than embedded devices, so you can expect less performance from the target device.
There are 3rd party applications like opencv-performance, that you can use to test/benchmark the environment once you get your hands on it. And if performance is such a big deal in this project, you might also be interested in this nice article which explain some timing tests done on couple of OpenCV features comparing implementations using the C and C++ interfaces of OpenCV.

iOS / C: Algorithm to detect phonemes

I am searching for an algorithm to determine whether realtime audio input matches one of 144 given (and comfortably distinct) phoneme-pairs.
Preferably the lowest level that does the job.
I'm developing radical / experimental musical training software for iPhone / iPad.
My musical system comprises 12 consonant phonemes and 12 vowel phonemes, demonstrated here. That makes 144 possible phoneme pairs. The student has to sing the correct phoneme pair 'laa duu bee' etc in response to visual stimulus.
I have done a lot of research into this, it looks like my best bet may be to use one of the iOS Sphinx wrappers ( iPhone App › Add voice recognition? is the best source of information I have found ). However, I can't see how I would adapt such a package, can anyone with experience using one of these technologies give a basic rundown of the steps that would be required?
Would training be necessary by the user? I would have thought not, as it is such an elementary task, compared with full language models of thousands of words and far greater and more subtle phoneme base. However, it would be acceptable (not ideal) to have the user train 12 phoneme pairs: { consonant1+vowel1, consonant2+vowel2, ..., consonant12+vowel12 }. The full 144 would be too burdensome.
Is there a simpler approach? I feel like using a fully featured continuous speech recogniser is using a sledgehammer to crack a nut. It would be far more elegant to use the minimum technology that would solve the problem.
So really I'm hunting for any open source software that recognises phonemes.
PS I need a solution which runs pretty much real-time. so even as they are singing the note, firstly it blinks on to illustrate that it picked up the phoneme pair that was sung, and then it glows to illustrate whether they are singing the correct note pitch

If you are looking for a phone-level open source recogniser, then I would recommend HTK. Very good documentation is available with this tool in the form of the HTK Book. It also contains an entire chapter dedicated to building a phone level real-time speech recogniser. From your problem statement above, it seems to me like you might be able to re-work that example into your own solution. Possible pitfalls:
Since you want to do a phone level recogniser, the data needed to train the phone models would be very high. Also, your training database should be balanced in terms of distribution of the phones.
Building a speaker-independent system would require data from more than one speaker. And lots of that too.
Since this is open-source, you should also check into the licensing info for any additional details about shipping the code. A good alternative would be to use the on-phone recorder and then have the recorded waveform sent over a data channel to a server for the recognition, pretty much something like what google does.

I have a little bit of experience with this type of signal processing, and I would say that this is probably not the type of finite question that can be answered definitively.
One thing worth noting is that although you may restrict the phonemes you are interested in, the possibility space remains the same (i.e. infinite-ish). User training might help the algorithms along a bit, but useful training takes quite a bit of time and it seems you are averse to too much of that.
Using Sphinx is probably a great start on this problem. I haven't gotten very far in the library myself, but my guess is that you'll be working with its source code yourself to get exactly what you want. (Hooray for open source!)
...using a sledgehammer to crack a nut.
I wouldn't label your problem a nut, I'd say it's more like a beast. It may be a different beast than natural language speech recognition, but it is still a beast.
All the best with your problem solving.

Not sure if this would help: check out OpenEars' LanguageModelGenerator. OpenEars uses Sphinx and other libraries.

http://www.hfink.eu/matchbox
This page links to both YouTube video demo and github source.
I'm guessing it would still be a lot of work to mould it into the shape I'm after, but is also definitely does do a lot of the work.

DirectX 9 or DirectX 10 for starters?

I want to do projects to make my resume more appealing to game companies. So I am going to start buying books. But I don't know rather to read DirectX 9 or 10 api books to start off with. DirectX10 is great, but it seems the industry is moving slow to 10. so should I use 9 or go with 10 ??

I would suggest learning the basics using directx9 and then rapidly moving on to dx11. DirectX11 is harder to get started in than DirectX9 because it's slightly more complex but also a lot of the utility functions in D3DX are no longer there, or have been moved to source code like the effects framework. This is no bad thing, but it does make it signifiacantly more complex to learn as you have to learn a lot more things at once.
Spend 2 or 3 weeks learning DX9 then move to DX11 for "real" work :P
Learn basic DX9 using the fixed pipeline and d3dx for loading models etc. It's a lot simpler than DX11 and much better documented, and you'll get a triangle and then a model on screen very much faster. Play with that until you completely understand the basic concepts and tranformations.
But then rewrite it all using shaders only. You'll need to use them in DX10/11 anyway but it's a lot easier to learn when you already have a working framework of code, and it's a lot simpler to get that working in DX9.
Once you have that working, learn DX11. You'll have to switch math libraries. You'll have to invent your own model formats and loaders. You'll have to either invent your own effects framework or use the example one, but they are all much easier now you already know the basics of 3d and programming shaders.

TBH further to OneOfOne's comment if you know how to do 3D development in GL, D3D9, D3D10 or D3D11 then you can transfer those skills to any of the others with a little bit of work.
Personally I'd aim for D3D11 as that way you are learning the cutting edge. You'll find you'll be able to do GL, D3D9 or D3D10 with a little work. Do enough work on the theory and you'll discover that its not even that hard to transfer the skills to a fully software engine.

If your intention is really to learn a skill that you would use in the game industry, stick with DirectX 9. Since DirectX 10 and 11 both require Vista or Window 7, game developers are still mostly ignoring them and targeting DirectX 9 in order to have support for Windows XP.
That being said, it doesn't really matter which you start with. The differences are not that large. If you understand the concepts behind 3D APIs and how the GPU pipeline works, you can pick up any of the three or even OpenGL with minimal effort.

Fact is, you need to learn both.
As long as 50% of gamers are still on WinXP, you're going to need to be able to program in Direct3D9.
D3D9 isn't any easier to get started with than D3D10/11. Its the same principles, with vertices to be placed, normals to be calculated, and meshes to be rendered. Whether you're creating a ID3D11BlendState structure or calling IDirect3DDevice9::SetRenderState(), its the same concept, just different ways of doing it.
After working with d3d11 a couple of days, I've come to think of it as better than DX9 in a lot of ways. For one, you're able to use the full caps of the GPU including geometry shaders. 2nd, it forces you to fully understand the graphics pipeline to even draw anything (note how functions are named after the stage of the pipeline they affect: here: (IA* fcns: input-assembler stage, OM* fcns: output-merger stage etc) ). This may result in a slightly larger INITIAL startup curve, but once you get it, its not any harder than D3D9 and is better, since the very naming of the functions helps concepts stick.
So get going on both, and learning them in tandem may help reduce the amount of effort you spend learning deprecated API's/methods of doing things from DX9 (ie you really want to spend more time using shaders, and don't use the fixed function pipeline section of DX9 too much).

You can check Luna's books for DX9 /DX11(I suggest you start with 11). You can check out http://www.rastertek.com/tutdx11.html but he doesn't explain everything so you can go in Luna s book to see what is with those functions or properties

With some little exceptions, DX10 is just a legacy free DX9. For example DX9 had build in options for rendering Flatshaded, Textured or using a Shader. In DX10 these options are gone, you always have to use a real shader. If you want to do flatshading, write a HLSL shader that does flat shading.
So I would suggest you learn DX10 (or DX11). You will be able to adopt fast to DX9 but with a more modern coding style by not using legacy functions. They can be quiet confusing, so DX10 will focus you on relevant things.
If you are a real beginner, and setting up a vertex-buffer to create a single triangle is confusing you (as real 3D-Programmer you are no more interesten in single triangles) I even would suggest to start with OpenGL. You will have faster success, but in reality this can be a little bit distracting as DX9-Legacy if you want to focus on modern 3D-Coding.

Yes do not waste your time with DX10 it was never really adopted as the industry standard for any period of time, there wasn't any big enough changes to warrant people upgrading from DX9 but for DX11 there was.

I suggest directx 11, there's no reason in my opinion to waste time on deprecated functions or techniques.
Learning shaders from the start will make things way more clear

Try doing the samples from the sample folder of both 9 and 10, and if your computer can support it, 11. This is what I am doing.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart