Does speech recognition work for WAV files created with iOS devices? - ios

I have been trying to use the Microsoft SpeechSDK Speech Recognition backend to work with WAV files created using AVAudioRecorder and noticed that the DataRecognitionClient doesn't seem to return any errors or partial/final responses.
If however, I export that same WAV file using Audacity to WAV (Microsoft) signed 16 bit PCM it works fine.
Repro:
Using an Apple device use the AVAudioRecorder to create an audio.wav file (with less than 2 min of conversation) with the following format settings:
let recordSettings: [String: AnyObject] = [
AVFormatIDKey: NSNumber(int: Int32(kAudioFormatLinearPCM)),
AVNumberOfChannelsKey: NSNumber(int: 1),
AVSampleRateKey: NSNumber(float: 16000.0),
AVLinearPCMBitDepthKey: NSNumber(int: 16),
AVLinearPCMIsFloatKey: false,
AVLinearPCMIsBigEndianKey: false]
Download and open your https://github.com/microsoft/cognitive-speech-stt-ios example project.
Open the SpeechRecognitionServerExample project and add the previously recorded audio.wav file (in step 1) into the SpeechRecognitionServerExample/Assets group.
Open ViewController.mm and go to the longWaveFile function and replace the file name with #"audio.wav"
Run the example and notice how no error is returned and nothing is recognized either.
Analysis:
The only thing that seems to be different from the provided sample WAV files in the project (batman.wav and whatstheweatherlike.wav) from the SpeechSDK samples is that the WAV files created by the AVAudioRecorder add the "FLLR" sub-chunk used for page alignment between the "fmt" and "data" sub-chunks in the file format header.
RIFF WAV Apple vs Microsoft
While this is non-standard it still is specification compliant and it seems it might not be accounted for, preventing speech recognition from occurring. Are there any suggested work arounds for this?
Update:
So I went ahead and created a new audio recording class which uses Audio Queues and does exactly the same thing as AVAudioRecorder, except that it removes the "FLLR" sub-chunk. This can be done upon the creation of the audio file by setting the AudioFileFlags.DontPageAlignAudioData flag.
AudioFileCreateWithURL(
filePathUrl,
kAudioFileWAVEType,
&dataFormat,
[.DontPageAlignAudioData, .EraseFile],
&audioFile)
Doing this causes speech recognition to start working. Does anyone know if there is a way to indicate AVAudioRecorder to not page align the audio data? I read through the Apple documentation and couldn't find any setting or option. I really don't want to maintain something that duplicates the existing functionality just because of this.

Related

Google cloud speech very inaccurate and misses words on clean audio

I am using Google cloud speech through Python and finding many transcriptions are inaccurate and missing several words. This is a simple script I'm using to return a transcript of an audio file, in this case 'out307.wav':
client = speech.SpeechClient()
with io.open('out307.wav', 'rb') as audio_file:
content = audio_file.read()
audio = speech.types.RecognitionAudio(content=content)
config = speech.types.RecognitionConfig(
enable_word_time_offsets=True,
language_code='en-US',
audio_channel_count=1)
response = client.recognize(config, audio)
for result in response.results:
alternative = result.alternatives[0]
print(u'Transcript: {}'.format(alternative.transcript))
This returns the following transcript:
to do this the tensions and suspicions except
This is very far off what the actual audio says (I've uploaded it at https://vocaroo.com/i/s1zdZ0SOH1Ki). The audio is a .wav and very clear with no background noise. This is worse than average, as in some cases it will get the transcription fully correct on a 10 second audio file, or it may miss just a couple of words. Is there anything I can do to improve results?
This is weird, I tried your audio file with your code and I get the same result, but, if I change the language_code to "en-UK" I am able to get the full response.
I'm working for Google Cloud and I created for you a public issue here, you can track there the updates.

Piano notes with AKKeyboardView

I am new to AudioKit - I am able to use the AKKeyboardView to play notes using AKOscillatorBank, but I want the audio to sound more like a grand piano. Loading .wav files seems to make the notes choppy. I have also changed the note envelope. How can I map grand piano notes onto the AKKeyboardView keys?
You're not easily going to get a piano sound out of an oscillator. You might want to use a soundfont instead. You can load an sf2 (but not sf3, I believe) into an AKAppleSampler and trigger it using AKKeyboardDelegate as you are doing with the AKOscillatorBank. MuseScore has list of soundfont file links, many of which use open source licenses.
First add the sf2 file to your project, then set up the AKAppleSampler:
let sampler = AKAppleSampler()
// note that if you're using a GM soundfont, 'Grand Piano' will be preset 0
sampler.loadMelodicSoundFont("NameOfSoundFontWithoutExtension", preset: 0)

Adding metadata to generated audio file

I'm generating an audio file programmatically, and I'd like to add metadata to it, such as the title and artist. I don't particularly care what format the file is written in, as long as AVPlayer will read it and send it to the playing device. (The whole goal is to send this generated audio and its track name to a Bluetooth device. I'm happy to explore easier ways to achieve this on iPhone that don't require writing the file or adding metadata directly to the file.)
So far I've discovered that AVAssetWriter will often just throw away metadata that it doesn't understand, without generating errors, so I'm stumbling a bit trying to find what combinations of file formats and keys are acceptable. So far I have not found a file format that I can auto-generate that AVAssetWriter will add any metadata to. For example:
let writer = try AVAssetWriter(outputURL: output, fileType: .aiff)
let title = AVMutableMetadataItem()
title.identifier = .commonIdentifierTitle
title.dataType = kCMMetadataBaseDataType_UTF8 as String
title.value = "The Title" as NSString
writer.metadata = [title]
// setup the input and write the file.
I haven't found any combination of identifiers or fileTypes (that I can actually generate) that will include this metadata in the file.
My current approach is to create the file as an AIFF, and then use AVAssetExportSession to rewrite it as an m4a. Using that I've been able to add enough metadata that iTunes will show the title. However, Finder's "File Info" is not able to read the title (which it does for iTunes m4a files). My assumption is that if it doesn't even show up in File Info, it's not going to be sent over Bluetooth (I'll be testing that soon, but I don't have the piece of hardware I need handy).
Studying iTunes m4a files, I've found some tags that I cannot recreate with AVMetadataItem. For example, Sort Name (sonm). I don't know how to write tags that aren't one of the known identifiers (and I've tested all 263 AVMetadataIdentifiers).
With that background, my core questions:
What metadata tags are read by AVPlayer and sent to Bluetooth devices (i.e. AVRCP)?
Is it possible to write metadata directly with AVAssetWriter to a file format that supports Linear PCM (or some other easy-to-generate format)?
Given a known tag/value that does not match any of the AVMetadataIdentifiers), is it possible to write it in AVAssetExportSession?
I'll explore third-party id3 frameworks later, but I'd like to achieve it with AVFoundation (or other built-in framework) if possible.
I've been able to use AVAssetWriter to store metadata values in a .m4a file using the iTunes key space:
let songID = AVMutableMetadataItem()
songID.value = "songID" as NSString
songID.identifier = .iTunesMetadataSongID
let songName = AVMutableMetadataItem()
songName.value = "songName" as NSString
songName.identifier = .iTunesMetadataSongName
You can write compressed .m4a files directly using AVAssetWriter by specifying the correct settings when you set up the input object, so there’s no need to use an intermediate AIFF file.

Playing multi-sampled Instruments using AudioKit, controlling ADSR envelope

I'm trying to play instrument of several .wav samples using AudioKit.
I've tried so far:
Using AKSampler (with underlying AVAudioUnitSampler) – it worked fine, but I can't figure out how to control ADSR envelope here – calling stop will stop note immediately.
Another way is to use AKSamplePlayer for each sample and play it, manually setting rate so it play the right note. I can (possibly?) then connect AKAmplitudeEnvelope to each sample player. But if I want to play 5 notes of the same sample simultaneously, I would need 5 instances of AKSamplePlayer, which seems like wasting resources.
I also tried to find a way to just push raw audio samples to the AudioKit output buffer, making mixing and sample interpolation by myself (in C, probably?). But didn't find how to do it :(
What is the right way to make a multi-sampled instrument using AudioKit? I feel like it must be a fairly simple task.
Thanks to mahal tertin, it's pretty easy to use AKAUPresetBuilder!
You can create .aupreset file somewhere in tmp directory and then load this instrument with AKSampler.
The only thing worth noting is that by default AKAUPresetBuilder will generate samples with trigger mode set to trigger, which will ignore note-off events. So you should set it explicitly.
For example:
let sampleC4 = AKAUPresetBuilder.generateDictionary(
rootNote: 60,
filename: pathToC4WavSample,
startNote: 48,
endNote: 65)
sampleC4["triggerMode"] = "hold"
let sampleC5 = AKAUPresetBuilder.generateDictionary(
rootNote: 72,
filename: pathToC5WavSample,
startNote: 66,
endNote: 83)
sampleC5["triggerMode"] = "hold"
AKAUPresetBuilder.createAUPreset(
dict: [sampleC4, sampleC5],
path: pathToAUPresetFilename,
instrumentName: "My Instrument",
attack: 0,
release: 0.2)
and then create a sampler and start AudioKit:
sampler = AKSampler()
try sampler.loadInstrument(atPath: pathToAUPresetFilename)
AudioKit.output = sampler
AudioKit.start()
and then use this to start playing note:
sampler.play(noteNumber: MIDINoteNumber(63), velocity: MIDIVelocity(120), channel: 0)
and this to stop, respecting release parameter:
sampler.stop(noteNumber: MIDINoteNumber(63), channel: 0)
Probably the best way would be to embed your wav files into an EXS or Soundfont format, making use of tools in that realm to accomplish the ADSR for instance. Otherwise you'll kind of have to have an instrument for each sample.

Why I am receiving only a few audio samples per second when using AVAssetReader on iOS?

I'm coding something that:
record video+audio with the built-in camera and mic (AVCaptureSession),
do some stuff with the video and audio samplebuffer in realtime,
save the result into a local .mp4 file using AVAssetWritter,
then (later) read the file (video+audio) using AVAssetReader,
do some other stuff with the samplebuffer (for now I do nothing),
and write the result into a final video file using AVAssetWriter.
Everything works well but I have an issue with the audio format:
When I capture the audio samples from the capture session, I can log about 44 samples/sec, which seams to be normal.
When I read the .mp4 file, I only log about 3-5 audio samples/sec!
But the 2 files look and sound exactly the same (in QuickTime).
I didn't set any audio settings for the Capture Session (as Apple doesn't allow it).
I configured the outputSettings of the 2 audio AVAssetWriterInput as follow:
NSDictionary *settings = #{
AVFormatIDKey:#(kAudioFormatLinearPCM),
AVNumberOfChannelsKey:#(2),
AVSampleRateKey:#(44100.),
AVLinearPCMBitDepthKey:#(16),
AVLinearPCMIsNonInterleaved:#(NO),
AVLinearPCMIsFloatKey:#(NO),
AVLinearPCMIsBigEndianKey:#(NO)
};
I pass nil to the outputSettings of the audio AVAssetReaderTrackOutput in order to receive samples as stored in the track (according to the doc).
So, the sample rate should be 44100Hz from the CaptureSession to the final file. Why I am reading only a few audio samples? And why is it working anyway? I have the intuition that it will not work well when I'll have to work with the samples (I need to update their timestamps for example).
I tried several other settings (such as kAudioFormatMPEG4AAC), but AVAssetReader can't read compressed audio formats.
Thanks for your help :)

Resources