Algorithm to determine cut short audio recordings - machine-learning

I work an an software which processes short speech recordings. Usually no more than a sentence or two. These recordings are sent to my software as WAV files. Sometimes due to a user error these recordings are cut mid-sentence. (A user starts/stops recording using a button and sometimes mistimes it). E.G. a user was saying "I want to buy 3 apples", but due to a mistimes recording button release it comes out as "I want to buy 3 ap".
I wonder if there is an algorithm ML or not that would allow me to filter out such cases? I accept that it won't be 100%, but I want to reduce number of false positives as much as possible.
May be something around an abrupt end of voice rather than slower dying down of voice at the end of a sentence?

Related

how to find an offset from two audio file ? one is noisy and one is clear

I have once scenario in which user capturing the concert scene with the realtime audio of the performer and at the same time device is downloading the live streaming from audio broadcaster device.later i replace the realtime noisy audio (captured while recording) with the one i have streamed and saved in my phone (good quality audio).right now i am setting the audio offset manually with trial and error basis while merging so i can sync the audio and video activity at exact position.
Now what i want to do is to automate the process of synchronisation of audio.instead of merging the video with clear audio at given offset i want to merge the video with clear audio automatically with proper sync.
for that i need to find the offset at which i should replace the noisy audio with clear audio.e.g. when user start the recording and stop the recording then i will take that sample of real time audio and compare with live streamed audio and take the exact part of that audio from that and sync at perfect time.
does any one have any idea how to find the offset by comparing two audio files and sync with the video.?
Here's a concise, clear answer.
• It's not easy - it will involve signal processing and math.
• A quick Google gives me this solution, code included.
• There is more info on the above technique here.
• I'd suggest gaining at least a basic understanding before you try and port this to iOS.
• I would suggest you use the Accelerate framework on iOS for fast Fourier transforms etc
• I don't agree with the other answer about doing it on a server - devices are plenty powerful these days. A user wouldn't mind a few seconds of processing for something seemingly magic to happen.
Edit
As an aside, I think it's worth taking a step back for a second. While
math and fancy signal processing like this can give great results, and
do some pretty magical stuff, there can be outlying cases where the
algorithm falls apart (hopefully not often).
What if, instead of getting complicated with signal processing,
there's another way? After some thought, there might be. If you meet
all the following conditions:
• You are in control of the server component (audio broadcaster
device)
• The broadcaster is aware of the 'real audio' recording
latency
• The broadcaster and receiver are communicating in a way
that allows accurate time synchronisation
...then the task of calculating audio offset becomes reasonably
trivial. You could use NTP or some other more accurate time
synchronisation method so that there is a global point of reference
for time. Then, it is as simple as calculating the difference between
audio stream time codes, where the time codes are based on the global
reference time.
This could prove to be a difficult problem, as even though the signals are of the same event, the presence of noise makes a comparison harder. You could consider running some post-processing to reduce the noise, but noise reduction in its self is an extensive non-trivial topic.
Another problem could be that the signal captured by the two devices could actually differ a lot, for example the good quality audio (i guess output from the live mix console?) will be fairly different than the live version (which is guess is coming out of on stage monitors/ FOH system captured by a phone mic?)
Perhaps the simplest possible approach to start would be to use cross correlation to do the time delay analysis.
A peak in the cross correlation function would suggest the relative time delay (in samples) between the two signals, so you can apply the shift accordingly.
I don't know a lot about the subject, but I think you are looking for "audio fingerprinting". Similar question here.
An alternative (and more error-prone) way is running both sounds through a speech to text library (or an API) and matching relevant part. This would be of course not very reliable. Sentences frequently repeat in songs and concert maybe instrumental.
Also, doing audio processing on a mobile device may not play well (because of low performance or high battery drain or both). I suggest you to use a server if you go that way.
Good luck.

How does video on demand over P2P work?

In simple terms how does video on demand and streaming video work over P2P? I assume videos are cut up into small pieces (a few seconds each) and these pieces are transferred in chunks. As soon as a user is finished watching a chunk, it is deleted from their computer. Wouldn't this mean if no user on the network was currently watching a certain instance (chunk/time slice?) of the video then it's permanently lost? If no, how does VoD over P2P work? If you store all the chunks then it's exactly the same as normal file sharing with P2P.
Let me know if any parts of the question are unclear and I'll try to improve it.
P2P Live: each user downloads and simultaneously uploads chunks for other users who watch the same stream. More users means better quality.
source: P2P TV - Wikipedia
P2P VOD: this is more challenging to achieve since like you noticed there's less simultaneity in the way users watch the video. In this case each user is expected to contribute a reasonable amount of disk space to store chunks for other users. The strategies concerning what to store on each user's cache are subject to ongoing research.
If you search for P2P VOD you will find a lot of white papers presenting different approaches. There are too many links to list here.

Voice Recording in Talking Character

I have managed recording user voice following the approach mentioned under below link. Recording plays back fine except the starting words that user say. Like if i say "Hello" it only records as "llo" which is because i measure power level to a threshold value where to start recording. Could anyone guide me how can i recover those initial letters to make it smooth ?
Making of Talking App
Thanks!

What is the best method of synchronizing audio across iOS devices with WiFi?

Basically, for my team's app, we need to be able to synchronize music across multiple iOS devices. The first way we did this was by having the music on all the devices already and just sending a play command to all the devices. Some would get it later than others, so that method did not work. There was an idea mentioned to calculate the latency between all the devices and send the commands at the appropriate times based on the latency.
The second way proposed would be to stream the music. If we were to implement streaming, how should we go about doing it. Should Audio Units be used, OpenAL, etc.? Also, if streaming was being done, how would we go about making sure that each device's stream was in sync.
Basically, the audio has to be in sync so that the person hearing it cannot differentiate between the devices. A few milliseconds off should not be a problem (unless the listener has super-human hearing).
You'd be amazed at how good the human ear us at spotting audio anomalies...
Sync the time of day
Effectively your trying to meet a real time requirement with a whole load if very variable things in the way (wifi, etc). I strongly suspect the only way you're going to get close to doing this is to issue a 'play' instruction that includes a particular time to start playing. Of course that relies on all the clocks being accurately set.
NTP
I don't know how iPhones get their time of day. If they use (or could use) NTP then you'll be getting close. NTP is designed to convey accurate time of day information over a network despite variable network delays. I've had a quick look and it seems that most NTP clients for iOS are the simple ones, not the full NTP that measures and tunes out network delays, clock drifts, etc.
GPS
Alternatively GPS is also a very good source of time information. Again I don't know if iPhones can or do use GPS for setting their clock but if it could be done then that would likely be pretty good. On Solaris (and I think Linux too) the 1 pulse per second that most GPS chips generate from the GPS signal can be used to condition the internal OS clock, making it very accurate indeed (sub microsecond accuracy).
I fear that iPhones don't do either of these things natively; both involve using a fair bit of electricity, so I wouldn't be surprised if they did something else less sophisticated.
Cell Time Service
Some Cell networks provide a time service too, but I don't think it's designed for accurate time setting. Also it tends not to be available everywhere. You often find it at major airports so that recent arrivals get their phones set to something close to local time.
Play at time X
So if one of those could be used to ensure that all the iPhones are set to exactly the same time of day then all you have to do is write your software to start playing at a specific time. That will probably involve polling the clock in a very tight loop waiting for it to tick over; most OSes don't provide a means of sleeping until a specific time. They do at least allow for sleeping for a period of time, which can be used to sleep until close to the appointed time. You'd then start polling the clock until the right time is reached.
Delay Measurement and Standard Deviation
Your first method is doomed I think. You might be able to measure average delays and so forth but that doesn't mean that every message has exactly the same latency. The standard deviation in the latency will tell you what you can expect to achieve, and I don't think that's going to be particularly small. If so then the message has got to include a timestamp.
NTP can work because it's only interested in the average delay measured over a period of time (hours sometimes), whereas you're interested in instantaneous delay.
Streaming with RTP
Your second method may work if you can time sync the devices as discussed above. The RTP protocol was designed for use in these circumstances; it doesn't help with achieving sync, but it does help a lot with the streaming. It tells you whereabouts in the stream any one piece of received data fits, allowing you to play it at the right time.
Clock Drift
Another problem to deal with is how long you're playing for. If it's a long time then you may discover that the 44kHz (or whatever) audio clock rate on each device isn't quite the same. So, whilst you might find a way of starting to play all at the same time, the separate devices will then start diverging ever so slightly. Over a long period of time they may be noticeably out.
BlueTooth
It might be possible to do something with BlueTooth. It has many weird and wonderful profiles, and it might be that one of those would serve to send an accurate 'start now' message.
Audio Trigger
You might also use sound as a means of conveying a start signal. One device can play a particular sound whilst your software in the others is listening with the mic. When a particular feature is detected in the sound, that's the time for everyone to start playing. Sort of a computerised "1, 2, a 1 2 3 4".
Camera Flash
Should be easy to spot in software...
I think your first way would work if you expand it a little bit. Assuming all the clocks on the devices are in sync you could include a timestamp in your play command. Then each device would calculate the time between the timestamp and when it received the command. You would then play the music and offset it by the time difference.

Active noise cancellation

I have programed a voice recognition program and I am have problems with the mic hearing me, over the computer playing music. I need software that can filter out the sound leaving the speakers from the sound entering the mic.
Is there software or a component (for Delphi) that would solve my problem?
You need to capture:
computer output
mic. input
Then you need to find two parameters, depending of your mic. location and sound system delay. This two parameter is n-delay and k-amplify.
Stream1[t+n]*k=Stream2[t]
Where t = time. When you find this parameter then your resulting Stream, only speek mic. input will be
Stream2[t]-Stream1[t+n]*k=MusicReductionStream[t]
I think you want to do what noise canceling microphones do. These systems use at least one extra microphone to calculate the difference between "surrounding noise" and the noise that is aimed directly at the microphone (the speech it has to register). I don't think you can reliably obtain the same effect with a software-only solution.
A first step would obviously be to turn music down :-)
Check out the AsioVST library.
100% open source Delphi code
Free
Very complete
Active (support for xe2 / x64 is being added for example)
Under Examples\Plugins\Crosstalk Cancellation\ you'll find the source code for a plugin that probably does what you're looking for.
The magic happens in DAV_DspCrosstalkCancellation.pas.
I think the speex pre-processor has an echo-cancellation feature. You'll need to feed it the audio data you recorded, and the audio you want to cancel, and it'll try to remove it.
The main problem is finding out what audio your computer plays. Not sure if there is a good API for that.
It also has a noise reduction feature, and voice activity detection. You can compile it as a dll, and then write a delphi header.
You need to estimate the impulse response of the speaker and room, etc., which can change with exact speaker and mic positioning and the size and contents of the room, etc., as well as knowing/estimating the system delay.
If the person or the mic are moveable, the impulse response and delay will need to be continually re-estimated.
Once you have estimated the impulse response, you can convolve it with the output signal and try subtract delayed versions of the result from the mic input until you can null silent portions of the speech input. Cross correlation might be useful for estimating the delay.

Resources