I am using Google cloud speech through Python and finding many transcriptions are inaccurate and missing several words. This is a simple script I'm using to return a transcript of an audio file, in this case 'out307.wav':
client = speech.SpeechClient()
with io.open('out307.wav', 'rb') as audio_file:
content = audio_file.read()
audio = speech.types.RecognitionAudio(content=content)
config = speech.types.RecognitionConfig(
enable_word_time_offsets=True,
language_code='en-US',
audio_channel_count=1)
response = client.recognize(config, audio)
for result in response.results:
alternative = result.alternatives[0]
print(u'Transcript: {}'.format(alternative.transcript))
This returns the following transcript:
to do this the tensions and suspicions except
This is very far off what the actual audio says (I've uploaded it at https://vocaroo.com/i/s1zdZ0SOH1Ki). The audio is a .wav and very clear with no background noise. This is worse than average, as in some cases it will get the transcription fully correct on a 10 second audio file, or it may miss just a couple of words. Is there anything I can do to improve results?
This is weird, I tried your audio file with your code and I get the same result, but, if I change the language_code to "en-UK" I am able to get the full response.
I'm working for Google Cloud and I created for you a public issue here, you can track there the updates.
Related
I recently figured out that Google's Vision API can accept an external image URL and I was curious if anyone knew if Google's Speech could accept an external video URL such as a YouTube video?
The code I have in my mind would look something like this:
def transcribe_gcs(yotube_url):
"""Asynchronously transcribes the audio file specified by the gcs_uri."""
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types
client = speech.SpeechClient()
audio = types.RecognitionAudio(uri=youtube_url) # swapped out gcs_uri with youtube_url
config = types.RecognitionConfig(
encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
# sample_rate_hertz=16000,
language_code='en-US')
operation = client.long_running_recognize(config, audio)
print('Waiting for operation to complete...')
response = operation.result(timeout=90)
# Each result is for a consecutive portion of the audio. Iterate through
# them to get the transcripts for the entire audio file.
for result in response.results:
# The first alternative is the most likely one for this portion.
print(u'Transcript: {}'.format(result.alternatives[0].transcript))
print('Confidence: {}'.format(result.alternatives[0].confidence))
I was curious if anyone knew if Google's Speech could accept an
external video URL such as a YouTube video?
It needs to be a local path to your audio file (less than 1 min audio file) or GCS URI for audio file longer than 1 minute. What you're thinking is not possible, the audio/video file needs to be in GCS.
I think you can achieve this by streaming same video (for example on wowza or on any server of your choice.) and then simply extract audio using lets say ffmpeg and pass this to google. It should work. use StreamingRecognizeRequest instead of RecognitionAudio.
I'm working on a video app, we are changing form regular mp4 files to HLS, one of the many reasons we have to do the change is that we hace much more control over the bandwidth usage of videos (we load lots of other stuff in our player, so we need to optimize the experience the best way).
So, AVFoundation introduced in iOS10 the ability to control the bandwidth using:
AVPlayerItem *playerItem = [AVPlayerItem playerItemWithAsset:self.urlAsset];
playerItem.preferredForwardBufferDuration = 30.0;
playerItem.preferredPeakBitRate = 200000.0; // Remember this line
There's also a configuration introduced on iOS11 to set the maximum resolution of the item with preferredMaximumResolution, So we're using it, but we still need a solution for iOS10 devices.
Well, now we have control over the preferredPeakBitRate that's nice, but we have a problem, not all the HLS sources are generated by us, so, let's say we want to set a maximum resolution of 480p when you're not connected to a wifi network, today I don't have way to achieve that, not always I'm going to be able to know how much bandwidth needs the 480p source for the selected HLS playlist.
One thing I was thinking about is to read the information inside the m3u8 file, to at least know which are the different quality sources that my player can show and how much bandwidth needs everyone.
One way to do this, would download the m3u8 playlist as a plain text, use a regex to read the file and process this data, well, I'm trying to avoid that, I think that this should far less difficult.
I cannot read this information from the tracks, because a) I can't find the information, b) the tracks are replaced dynamically when changing the quality, yeah 1 track for every quality level.
So, I don't know how I can get this information, I've searched google, stackoverflow and I can't find this information, does any one can help me?
Here's an example for what I want to do, I have this example playlist:
#EXTM3U
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=314000,RESOLUTION=228x128,CODECS="mp4a.40.2"
test-hls-1-16a709300abeb08713a5cada91ab864e_hls_duplex_192k.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=478000,RESOLUTION=400x224,CODECS="avc1.42001e,mp4a.40.2"
test-hls-1-16a709300abeb08713a5cada91ab864e_hls_duplex_400k.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=691000,RESOLUTION=480x270,CODECS="avc1.42001e,mp4a.40.2"
test-hls-1-16a709300abeb08713a5cada91ab864e_hls_duplex_600k.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=1120000,RESOLUTION=640x360,CODECS="avc1.4d001f,mp4a.40.2"
test-hls-1-16a709300abeb08713a5cada91ab864e_hls_duplex_1000k.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=1661000,RESOLUTION=960x540,CODECS="avc1.4d001f,mp4a.40.2"
test-hls-1-16a709300abeb08713a5cada91ab864e_hls_duplex_1500k.m3u8
And I just want to have that information available on an array inside my code, something like this:
NSArray<ZZMetadata *> *metadataArray = self.urlAsset.bandwidthMetadata;
NSLog(#"Metadata info: %#", metadataArray);
And print something like this:
<__NSArrayM 0x123456789> (
<ZZMetadata 0x234567890> {
trackId: 1
neededBandwidth: 314000
resolution: 228x128
codecs: ...
...
}
<ZZMetadata 0x345678901> {
trackId: 2
neededBandwidth: 478000
resolution: 400x224
}
...
}
I'm trying to use Dart to get an OGG file to loop using the HTML5 <audio> element. Does anyone have a working example for this. I'm specifically having trouble getting the audio to loop.
I was not able to have a fully controlled loop using the html5 AudioElement; sometimes the loop option was simply not working, sometimes there was a gap, sometimes patterns would overlap.
I had better chance using WebAudio using something like:
source = audioContext.createBufferSource();
source.buffer = buffer;
gainNode = audioContext.createGain();
gainNode.gain.value = 1;
source.connectNode(gainNode);
gainNode.connectNode(audioContext.destination);
// play it now in loop
source.start(audioContext.currentTime);
source.loop = true;
I was not able to load the source buffer from the html audio element which could have been a solution for the CORS issues I had. The samples were loaded using http requests.
I created a dartpad example that demonstrates looping using AudioElement native loop feature and WebAudio
https://dartpad.dartlang.org/879424bca794c63698b0
I have a question.
Recently I needed to add custom tags for recorded video. Local video on device not a streamed video. The task is to add some event specific tags in video, position of which could be set by pressing forward/backward like buttons like in any player.
It is not important whether the movie file will be mov file or mp4 format.
I searched on forum, found several samples how to add metadata using AVExportSession & it worked.
Although, when I tried to add metadata using AVAssetWriter. I wasn't able to append attributes to video.
What I do not understand is that after adding attribute, returned (time & duration) properties are always invalid.
For instance let's say I have a video with duration 2 seconds.
I have tried different key spaces. I am not able to write keys' from ID3 space.
IS ID3 used for stream video? (as far as I understood ID3 metadata of .mp3). Therefore, I was not able to write it into MPEG-4 file
I also used QuickTimeUserData & ISOUserData but again results are the same.
Here is an example
AVMutableMetadataItem *item2 = [AVMutableMetadataItem new];
item2.keySpace = AVMetadataKeySpaceiTunes;
item2.key = AVMetadataiTunesMetadataKeyUserComment;
item2.value = #"One two three";
item2.duration =CMTimeMakeWithSeconds(1, 1);
item2.time = CMTimeMakeWithSeconds(0, 1);
After reading I got the following:
AVMutableMetadataItem: 0xa4301f0, keySpace=itsk, key=\U00a9cmt, commonKey=(null), locale= (null), value=One two three, time={INVALID}, duration={INVALID}, extras={\n dataType = 1;\n}
I would like to use time & duration properties for metadata instead of writing custom data and processing it after that.
Ideally it would be great to append array of items with time = t1, duration = d1, .... (tn,dn).
Does anyone know how to accomplish that?
I've ended with a solution adding chapters to a video file instead of using metadata.
I looked at available libraries, took mpv4lib.
The library currently is not compiled for iOS, therefore, I ported the source project into static library for iOS platform.
That library allows to add custom "atoms" to mp4 file, and one of them is Quick Time text track, containing chapters.
I do similar with that post
The library is located here.
HI
I am new to Blackberry.
I am developing an application to get the song name from the live audio stream. I am able to get the mp3 stream bytes from the particular radioserver.To get the song name I add the flag "Icy-metadata:1".So I am getting the header from the stream.To get the mp3 block size I use "Icy-metaInt".How to recognize the metadatablocks with this mp3 block size.I am using the following code.can anyone help me to get it...Here the b[off+k] is the bytes that are from the server...I am converting whole stream in to charArray which is wrong, but how to recognize the metadataHeaders according to the mp3 block size..
b[off+k] = buffers[PlayBuf]PlayByte];
String metaSt = httpConn.getHeaderField("icy-metaint");
metaInt=Integer.parseInt(metaSt);
for (int i=0;i<b[off+k];i++)
{
metadataHeader+=(new String(b)).toCharArray();
System.out.println(metadataHeader);
metadataLength--;
Blackberry has no native regex functionality; I would recommend grabbing the regexp-me library (http://code.google.com/p/regexp-me/) and compiling it into your code. I've used it before and its regex support is pretty good. I think the regex in the code you posted would work just fine.