how to find an offset from two audio file ? one is noisy and one is clear - ios

I have once scenario in which user capturing the concert scene with the realtime audio of the performer and at the same time device is downloading the live streaming from audio broadcaster device.later i replace the realtime noisy audio (captured while recording) with the one i have streamed and saved in my phone (good quality audio).right now i am setting the audio offset manually with trial and error basis while merging so i can sync the audio and video activity at exact position.
Now what i want to do is to automate the process of synchronisation of audio.instead of merging the video with clear audio at given offset i want to merge the video with clear audio automatically with proper sync.
for that i need to find the offset at which i should replace the noisy audio with clear audio.e.g. when user start the recording and stop the recording then i will take that sample of real time audio and compare with live streamed audio and take the exact part of that audio from that and sync at perfect time.
does any one have any idea how to find the offset by comparing two audio files and sync with the video.?

Here's a concise, clear answer.
• It's not easy - it will involve signal processing and math.
• A quick Google gives me this solution, code included.
• There is more info on the above technique here.
• I'd suggest gaining at least a basic understanding before you try and port this to iOS.
• I would suggest you use the Accelerate framework on iOS for fast Fourier transforms etc
• I don't agree with the other answer about doing it on a server - devices are plenty powerful these days. A user wouldn't mind a few seconds of processing for something seemingly magic to happen.
Edit
As an aside, I think it's worth taking a step back for a second. While
math and fancy signal processing like this can give great results, and
do some pretty magical stuff, there can be outlying cases where the
algorithm falls apart (hopefully not often).
What if, instead of getting complicated with signal processing,
there's another way? After some thought, there might be. If you meet
all the following conditions:
• You are in control of the server component (audio broadcaster
device)
• The broadcaster is aware of the 'real audio' recording
latency
• The broadcaster and receiver are communicating in a way
that allows accurate time synchronisation
...then the task of calculating audio offset becomes reasonably
trivial. You could use NTP or some other more accurate time
synchronisation method so that there is a global point of reference
for time. Then, it is as simple as calculating the difference between
audio stream time codes, where the time codes are based on the global
reference time.

This could prove to be a difficult problem, as even though the signals are of the same event, the presence of noise makes a comparison harder. You could consider running some post-processing to reduce the noise, but noise reduction in its self is an extensive non-trivial topic.
Another problem could be that the signal captured by the two devices could actually differ a lot, for example the good quality audio (i guess output from the live mix console?) will be fairly different than the live version (which is guess is coming out of on stage monitors/ FOH system captured by a phone mic?)
Perhaps the simplest possible approach to start would be to use cross correlation to do the time delay analysis.
A peak in the cross correlation function would suggest the relative time delay (in samples) between the two signals, so you can apply the shift accordingly.

I don't know a lot about the subject, but I think you are looking for "audio fingerprinting". Similar question here.
An alternative (and more error-prone) way is running both sounds through a speech to text library (or an API) and matching relevant part. This would be of course not very reliable. Sentences frequently repeat in songs and concert maybe instrumental.
Also, doing audio processing on a mobile device may not play well (because of low performance or high battery drain or both). I suggest you to use a server if you go that way.
Good luck.

Related

How many sounds can be played at a time on iOS - AVAudioPlayer vs. AVAudioEngine & AVAudioPlayerNode

I have an application in which there is a set of about 50 sounds, which range in length from about 300 ms to about 4 seconds. Various combinations of sounds need to be played at precise times (up to 10 of them can be triggered at once). Some sounds need to be repeated at intervals as short as 100 ms.
I've implemented this is as a two dimensional array of AVAudioPlayers, all of which are loaded with sounds at application launch. There are several players for each sound, to accommodate rapidly repeating sounds. The players for a particular sound are reused in strict rotation. When a new sound is scheduled, the oldest player for that sound is stopped and its current time is set to 0, so the sound will repeat from the start, the next time it's scheduled using player.play(atTime:). There's a thread that schedules new sets of sounds about 300 ms before they are to be played.
It all works quite nicely, up to a point that varies with the device. Eventually, as sounds are played more rapidly, and/or more simultaneous sounds are scheduled, some sounds will refuse to play.
I'm contemplating switching to AVAudioEngine and AVAudioPlayerNodes, using a mixer node. Does anyone know if that approach is likely to handle more simultaneous sounds? My guess is that both approaches translate into a rather similar set of CoreAudio functions, but I haven't actually written the code to test that hypothesis - before I do that, I'm hoping that someone else may have explored this issue before me. I've been deep into CoreAudio before, and I'm hoping to be able to use these handy high-level functions instead!
Also, does anyone know of a way to trigger a closure when a sound initiates? The documented functionality allows for a callback closure, but the only way I've been able to trigger events when the sounds start, is to create a high quality of service queue for DispatchQueue. Unfortunately, depending on the system load, queued events may be executed at times that vary from the scheduled times by up to about 50 ms, which is not quite as precise as I'd prefer to be.
Using AVAudioEngine with AVAudioPlayersNodes provides much better performance, albeit at the cost of a bit of code complexity. I was able to easily increase the playback rate by a factor of five, with better buffer control.
The main drawback in switching to this approach was that Apple's documentation is less than stellar. A few additions to Apple's documentation would have made this task a LOT easier:
Mixer nodes are documented as being able to convert sample rates and channel counts, so I attempted to configure audioEngine.mainMixerNode to convert mono buffers to the output node's settings. Setting the main mixer node's output to the output node's format appeared to be accepted, but threw opaque errors at run time that complained about channel count mismatches.
It appears that the main mixer node is not actually a fully functional mixer node. To get this to work, I had to insert another mixer node that performed the channel conversion, and connect it to the main mixer node. If Apple's documentation had actually mentioned this, it would have saved me a lot of experimentation.
Also, just scheduling a buffer does not cause anything to play. You need to call play() on the player node before anything will happen. Apple's documentation is confusing here - it says that calling play() with no arguments will cause playback to occur immediately, which wasn't what I wanted. It took some experimentation to determine that play() just tells the player node to wake up, and that scheduled buffers will actually be played at the scheduled time, rather than immediately.
It would have been enormously helpful if Apple had provided more than the auto-generated class documentation. A bit of human-generated documentation would have saved me an awful lot of frustrating experimentation.
Chris Adamson's well-written "Learning Core Audio" was very helpful when I was working with Core Audio - it's a shame that the newer AVAudioEngine functionality isn't documented nearly as well.

Establishing synchronized music streaming across devices

I am attempting to stream audio files from a server to iOS devices and play them completely synchronized. For example on my phone I might be 20 secs into a song and then my friend next to me should also be 20 secs into the song as well. I know this is not an easy problem to solve, but I am attempting to do so.
I can currently get them within one second of each other by calculating the difference in time between the devices and then have them sync up, however that is not good enough because the human ear can detect a major difference in a second and this is over WIFI.
My next approach is going to be to unicast the one file from the server and then have the all devices pick it up directly from the server and then implement some type of buffer system similar to netflix so that network connectivity would be a limiting factor. http://www.wowza.com/ is what I would use to help with that.
I know this can be done, because http://lysn.in/ is does it with their app and I want to be able to do something similar.
Any other recommendations after I try my unicast option?
Would implementing firebase help solve a lot of the heavy lifting problems?
(1) In answer to ONE of your questions (the final one):
Firebase is not "realtime" in "that sense" -- PubNub is probably (almost certainly) the fastest "realtime" messaging for and between apps/browser/etc.
But they don't mean real-time in the sense of real-time, say, as race game engineers mean it or indeed in your use-case.
So firebase is not relevant to you here and won't help.
(2) Regarding your second general question: "how to synchronise time on two or more devices, given that we have communications delays."
Now, this is a really well-travelled problem in computer science.
It would be pointless outlining it here, because it is fully explained here http://www.ntp.org/ntpfaq/NTP-s-algo.htm if you click on "How is time synchronised"?
So in fact, to get a good time base on both machines, you should use that! Have both machines really accurately set a time to NTP using the existing (perfected for decades) NTP synchronisation.
(So for example https://stackoverflow.com/a/6744978/294884 )
In fact are you doing this?
It's possible that doing that will solve all your problems; then just agree to start at a certain exact time.
Hope it helps!
I would recommend against using the data movement to synchronize the playback. This should be straightforward to do with a buffer and a periodic "sync" signal that is sent at a period of < 1/2 the buffer size. Worst case this should generate a small blip on devices that get ahead or behind relative to the sync signal.

What is the best method of synchronizing audio across iOS devices with WiFi?

Basically, for my team's app, we need to be able to synchronize music across multiple iOS devices. The first way we did this was by having the music on all the devices already and just sending a play command to all the devices. Some would get it later than others, so that method did not work. There was an idea mentioned to calculate the latency between all the devices and send the commands at the appropriate times based on the latency.
The second way proposed would be to stream the music. If we were to implement streaming, how should we go about doing it. Should Audio Units be used, OpenAL, etc.? Also, if streaming was being done, how would we go about making sure that each device's stream was in sync.
Basically, the audio has to be in sync so that the person hearing it cannot differentiate between the devices. A few milliseconds off should not be a problem (unless the listener has super-human hearing).
You'd be amazed at how good the human ear us at spotting audio anomalies...
Sync the time of day
Effectively your trying to meet a real time requirement with a whole load if very variable things in the way (wifi, etc). I strongly suspect the only way you're going to get close to doing this is to issue a 'play' instruction that includes a particular time to start playing. Of course that relies on all the clocks being accurately set.
NTP
I don't know how iPhones get their time of day. If they use (or could use) NTP then you'll be getting close. NTP is designed to convey accurate time of day information over a network despite variable network delays. I've had a quick look and it seems that most NTP clients for iOS are the simple ones, not the full NTP that measures and tunes out network delays, clock drifts, etc.
GPS
Alternatively GPS is also a very good source of time information. Again I don't know if iPhones can or do use GPS for setting their clock but if it could be done then that would likely be pretty good. On Solaris (and I think Linux too) the 1 pulse per second that most GPS chips generate from the GPS signal can be used to condition the internal OS clock, making it very accurate indeed (sub microsecond accuracy).
I fear that iPhones don't do either of these things natively; both involve using a fair bit of electricity, so I wouldn't be surprised if they did something else less sophisticated.
Cell Time Service
Some Cell networks provide a time service too, but I don't think it's designed for accurate time setting. Also it tends not to be available everywhere. You often find it at major airports so that recent arrivals get their phones set to something close to local time.
Play at time X
So if one of those could be used to ensure that all the iPhones are set to exactly the same time of day then all you have to do is write your software to start playing at a specific time. That will probably involve polling the clock in a very tight loop waiting for it to tick over; most OSes don't provide a means of sleeping until a specific time. They do at least allow for sleeping for a period of time, which can be used to sleep until close to the appointed time. You'd then start polling the clock until the right time is reached.
Delay Measurement and Standard Deviation
Your first method is doomed I think. You might be able to measure average delays and so forth but that doesn't mean that every message has exactly the same latency. The standard deviation in the latency will tell you what you can expect to achieve, and I don't think that's going to be particularly small. If so then the message has got to include a timestamp.
NTP can work because it's only interested in the average delay measured over a period of time (hours sometimes), whereas you're interested in instantaneous delay.
Streaming with RTP
Your second method may work if you can time sync the devices as discussed above. The RTP protocol was designed for use in these circumstances; it doesn't help with achieving sync, but it does help a lot with the streaming. It tells you whereabouts in the stream any one piece of received data fits, allowing you to play it at the right time.
Clock Drift
Another problem to deal with is how long you're playing for. If it's a long time then you may discover that the 44kHz (or whatever) audio clock rate on each device isn't quite the same. So, whilst you might find a way of starting to play all at the same time, the separate devices will then start diverging ever so slightly. Over a long period of time they may be noticeably out.
BlueTooth
It might be possible to do something with BlueTooth. It has many weird and wonderful profiles, and it might be that one of those would serve to send an accurate 'start now' message.
Audio Trigger
You might also use sound as a means of conveying a start signal. One device can play a particular sound whilst your software in the others is listening with the mic. When a particular feature is detected in the sound, that's the time for everyone to start playing. Sort of a computerised "1, 2, a 1 2 3 4".
Camera Flash
Should be easy to spot in software...
I think your first way would work if you expand it a little bit. Assuming all the clocks on the devices are in sync you could include a timestamp in your play command. Then each device would calculate the time between the timestamp and when it received the command. You would then play the music and offset it by the time difference.

How to synchronize audio playback on 2 or more iOS devices?

I would like to write a web application that allows me to sync audio playback of an MP3 down to ~50ms, or close enough that the human ear can't detect the difference.
The idea would be that two or more smartphones could each be paired to a bluetooth speaker, and two or more speakers would play the same audio at the exact same time.
How would you suggest I go about setting this up, both client-side and server-side? I'm planning to use Rails/Ruby for backend, and iOS/obj c for mobile dev.
I had though of the idea of syncing to a global/atomic clock on the server, and having the server provide instructions to clients on when to start playing/jump in to an already playing track. My concern is that, if I want to stream the audio, that it will be impossible to load a song into memory and start playback accurately on the millisecond level.
Thoughts?
The jitter in internet packet delivery will be too large, so forget about syncing over the internet. However you could check the accuracy of NTP which is still used (I guess, I know that older UNIX's used it) by the OS when you switch on automatic date/time in Settings, but my guess is that it won't be good enough either. But perhaps the OS may also use other time sources like GPS; I'm don't know how iOS does it but accuracy within 20ms is not to be expected. You could create experimental app to check it out.
So, what's left is a sync closer to home, meaning between the devices directly. Of course you need to make sure that all devices haves loaded (enough of) the song, and have preloaded it in AVAudioPlayer or whatever you're using, to be able to start playing immediately. (It may actually not be the best idea to use higher level 'AVAudioPlayer` API's as it may give higher delays, and what more important higher jitter, than lower level API's.)
Here are three ideas (one device needs to be master triggering the start play, the others are slaves that are waiting for the trigger):
Use an audio trigger pulse, like a high tone of a defined length and frequency. Then use FFT to recognise this tone.
Connect the devices via GameKit Bluetooth and transmit the trigger on these connections.
Use the iPhone 4+ flash as trigger: flash in a certain pattern. This would require you to sample the video data which is quite doable and can be very fast.
I'm going with a solution that uses an atomic clock for synchronization, and an external service that allows server instructions/messages to be sent to all devices in close sync.

Active noise cancellation

I have programed a voice recognition program and I am have problems with the mic hearing me, over the computer playing music. I need software that can filter out the sound leaving the speakers from the sound entering the mic.
Is there software or a component (for Delphi) that would solve my problem?
You need to capture:
computer output
mic. input
Then you need to find two parameters, depending of your mic. location and sound system delay. This two parameter is n-delay and k-amplify.
Stream1[t+n]*k=Stream2[t]
Where t = time. When you find this parameter then your resulting Stream, only speek mic. input will be
Stream2[t]-Stream1[t+n]*k=MusicReductionStream[t]
I think you want to do what noise canceling microphones do. These systems use at least one extra microphone to calculate the difference between "surrounding noise" and the noise that is aimed directly at the microphone (the speech it has to register). I don't think you can reliably obtain the same effect with a software-only solution.
A first step would obviously be to turn music down :-)
Check out the AsioVST library.
100% open source Delphi code
Free
Very complete
Active (support for xe2 / x64 is being added for example)
Under Examples\Plugins\Crosstalk Cancellation\ you'll find the source code for a plugin that probably does what you're looking for.
The magic happens in DAV_DspCrosstalkCancellation.pas.
I think the speex pre-processor has an echo-cancellation feature. You'll need to feed it the audio data you recorded, and the audio you want to cancel, and it'll try to remove it.
The main problem is finding out what audio your computer plays. Not sure if there is a good API for that.
It also has a noise reduction feature, and voice activity detection. You can compile it as a dll, and then write a delphi header.
You need to estimate the impulse response of the speaker and room, etc., which can change with exact speaker and mic positioning and the size and contents of the room, etc., as well as knowing/estimating the system delay.
If the person or the mic are moveable, the impulse response and delay will need to be continually re-estimated.
Once you have estimated the impulse response, you can convolve it with the output signal and try subtract delayed versions of the result from the mic input until you can null silent portions of the speech input. Cross correlation might be useful for estimating the delay.

Resources