Question 1: Is it feasible? (As far as I know [info get from google], it is feasible. However, I need a more affirmative answer.)
Question 2: Say I have a device that generates square wave, how can I get the message?
As a beginner, I want to know to which class I should pay my attention to?
Thanks, any info will be appreciated.
This is a great question. I've put together a C library which does this, so you might be able to adapt it for iOS.
Library: https://github.com/quiet/quiet
Live Demo: https://quiet.github.io/quiet-js/lab.html
With a sound port, you want to avoid square waves. Those make inefficient use of the range of amplitudes you have available, and they're not very spectrally efficient. The most basic modulation people typically use here is frequency shift keying. My library offers that (as gaussian minimum shift keying) but also more advanced modes like phase shift keying and quadrature amplitude shift keying. I've managed to reach transfer speeds of 64kbps using this library.
Related
OK,
let me try and rephrase this:
I'm looking for a method, that takes an audio file as an input, and outputs a list of transients (distinctive peak), based upon a given sensitivity.
The audio is a recording of a spoken phrase of for example 5 words. The method would return a list of numbers (e.g. amount of samples or milliseconds) where the words start. My ultimate goal is to play each word individually.
As suggested in a comment (I really struck some negative chord here) I am NOT asking anyone to write any code for me.
I've been around on this forum a while now, and the community has always been very helpful. The most helpful answers were those that pointed out my rigid way of thinking, offering surprising alternatives or work arounds, based upon their own experiences.
I guess this topic is just too much of a niche.
Before Edit:
For my iOS app, I need to programmatically cut up a spoken phrase into words for further processing. I know what words to expect, so, I can make some assumptions on where words would start.
However, in any case, a transient detection algorithm/method would be very helpful.
Google points me to either commercial products, or highly academic papers that are beyond my brain power.
Luckily, you are much smarter and knowledgeable than me, so you can help and simplify my problems.
Don't let me down!
There are a couple of simple basic ideas you can put to work here.
First, take the input audio and divide into small sized buckets (on the order of 10's on milliseconds). For each bucket compute the power of the samples in it by summing the squares of each sample value.
For example say you had 16 bit samples at 44.1 kHz, in array called s. One second's worth of data would be 44100 samples. A 10 msec bucket size would give you 441 samples per bucket. To compute the power you could do this:
float power = 0;
for (int i = 0; i < 441; i++) {
float normalized = (float)s[i] / 32768.0f;
power = power + (normalized * normalized);
}
Once you build an array of power values, you can look at relative changes in power from bucket to bucket to do basic signal detection.
Good luck!
Audio analysis is a very complex topic. You could easily detect individual words and slice them apart, but actually identifying them requires a lot of processing and advanced algorythms.
Sadly, there is not much we can tell you besides that there is no way around it. You said you found commercial products and I would suggest going for those. Papers are not always complete enough or right for the language/platform/usecase you want, and often lack details for proper implementation for someone without prior knowledge of the topic.
You may be lucky and find an open source implementation that suits your needs. Here's what a little bit of research returned:
How to use Speech Recognition inside the iOS SDK?
free speech recognition engines for iOS?
You'll quickly see speech recognition is not something you should start from scratch. Choose a library, try it for a little bit and see if it works!
I'm a Software Engineering student in my last year in a 4-year bachelor degree program, I'm required to work on a graduation project of my own choice.
we are trying to find a way to notify the user of any thing the gets on his/her way while walking, this will be implemented as an android application so we have the ability to use the camera, we thought of Image processing and computer vision but neither me or any of my group members have any Image processing background, we searched a little bit and we found out about OpenCv.
So my question is do I need any special background to deal with OpenCv? and is it a good choice for the objective of my project to use computer vision, if not what alternatives do u advise me to use?
I appreciate your help.. thanks in advance!
At the first glance I would use 2 standard cameras to find depth image - stereo vision (similar to MS Kinect depth sensor)
from that it would be easy to fix a threshold to some distance.
Those algorithms are very CPU hungry so I do not think it will work on Android (although I have zero experience).
I you must use Android, I would look for some depth sensor (to avoid extracting depth data from 2 images)
For prototyping I would use MATLAB (or Octave), then I would switch to OpenCV (pointers, mem. allocations, blah...)
I want to make an iOS app that allows me to graph the intonation (the rise and fall of the pitch of their voice) of an audio sample as read in by the user. Intonation is very important in various languages around the world and this would be an attempt to practice intonation as well as pronunciation.
I am not very versed in the world of speech/audio technology, so what do I need? Are there libraries that come installed with Cocoa-touch that gives me the ability to access the data I need from a voice sample? What exactly am I going to be looking to capture?
If anyone has an idea of the technology I am going to need to leverage, I would appreciate a point in the right direction.
Thanks!
What you're looking for is called formant analysis.
Formants are, in essence, the spectral peaks of the uttered sounds. They are listed in order of frequency, as in f1, f2, etc. Seems to me that what you're looking to plot is f1.
Formant analysis is at the core of speech recognition, usually f1 and f2 are enough to differentiate vowels apart. I'd recommend you do a search on formant analysis algorithms and take it from there.
Good luck :)
I need to write a speech detection algorithm (not speech recognition).
At first I thought I just have to measure the microphone power and compare it to some threshold value. But the problem gets much harder once you have to take the ambient sound level into consideration (for example in a pub a simple power threshold is crossed immediately because of other people talking).
So in the second version I thought I have to measure the current power spikes against the average sound level or something like that. Coding this idea proved to be quite hairy for me, at which point I decided it might be time to research already existing solutions.
Do you know of some general algorithm description for speech detection? Existing code or library in C/C++/Objective-C is also fine, be it commercial or free.
P.S. I guess there is a difference between “speech” and “sound” recognition, with the first one only responding to frequencies close to human speech range. I’m fine with the second, simpler case.
The key phrase that you need to Google for is Voice Activity Detection (VAD) – it's implemented widely in telecomms, particularly in Acoustic Echo Cancellation (AEC).
I plan to write a conversation analysis software, which will recognize the individual speakers, their pitch and intensity. Pitch and intensity are somewhat straightforward (pitch via autocorrelation).
How would I go about recognizing individual speakers, so I can record his/her features? Will storing some heuristics for each speaker's frequencies be enough? I can assume that only one person speaks at a time (strictly non-overlapping). I can also assume that for training, each speaker can record a minute's worth of data before actual analysis.
Pitch and intensity on their own tell you nothing. You really need to analyse how pitch varies. In order to identify different speakers you need to transform the speech audio into some kind of feature space, and then make comparisons against your database of speakers in this feature space. The general term that you might want to Google for is prosody - see e.g. http://en.wikipedia.org/wiki/Prosody_(linguistics). While you're Googling you might also want to read up on speaker identification aka speaker recognition, see e.g. http://en.wikipedia.org/wiki/Speaker_identification
If you are still working on this... are you using speech-recognition on the sound input? Because Microsoft SAPI for example provides the application with a rich API for digging into the speech sound wave, which could make the speaker-recognition problem more tractable. I think you can get phoneme positions within the waveform. That would let you do power-spectrum analysis of vowels, for example, which could be used to generate features to distinguish speakers. (Before anybody starts muttering about pitch and volume, keep in mind that the formant curves come from vocal-tract shape and are fairly independent of pitch, which is vocal-cord frequency, and the relative position and relative amplitude of formants are (relatively!) independent of overall volume.) Phoneme duration in-context might also be a useful feature. Energy distribution during 'n' sounds could provide a 'nasality' feature. And so on. Just a thought. I expect to be working in this area myself.