timestamps seem to allow 0 seconds duration for some words in results, is this a bug? - google-cloud-speech

When using the google cloud speech api, the new word accurate timestamps/timecode feature, seem to allow 0 seconds duration for some words in results, here is an example
...
{ startTime: '48.800s', endTime: '48.800s', word: 'a' },
{ startTime: '48.800s', endTime: '49.200s', word: 'kindly' },
...
is this a bug?
To test I used a clip from audio archive "Arthur the Rat", "USA - General mid-western speaker (Michigan)".

you can get better than second precision using the returned timestamp.
you get the start time out of the structure containing the word and you can output it in the following way:
start_time.seconds + start_time.nanos * 1e-9

David Anderson's answer is correct, I just thought I'd elaborate it as I initially thought the response is only to the second precision and not 100ms as the docs describe.
As of July 2018, sending a request to the google cloud speech API including word time offsets returns a response object where each word result in response.results has the structure:
start_time {
seconds: 24
nanos: 100000000
}
end_time {
seconds: 24
nanos: 700000000
}
word: "of"
The nanos field allows you to get the start and end time to the 100ms precision. So you can obtain the start and end times like so:
print(start_time.seconds + start_time.nanos * 1e-9)
print(end_time.seconds + end_time.nanos * 1e-9)
==== Output ====
24.1
24.7

Related

Parse time string to hours, minutes and seconds in Lua

I am currently working on a plugin for grandMA2 lighting control using Lua. I need the current time. The only way to get the current time is the following function:
gma.show.getvar('TIME')
which always returns the current system time, which I then store in a variable. An example return value is "12h54m47.517s".
How can I separate the hours, minutes and seconds into 3 variables?
If os.date is available (and matches gma.show.getvar('TIME')), this is trivial:
If format starts with '!', then the date is formatted in Coordinated Universal Time. After this optional character, if format is the string "*t", then date returns a table with the following fields: year, month (1–12), day (1–31), hour (0–23), min (0–59), sec (0–61, due to leap seconds), wday (weekday, 1–7, Sunday is 1), yday (day of the year, 1–366), and isdst (daylight saving flag, a boolean). This last field may be absent if the information is not available.
local time = os.date('*t')
local hour, min, sec = time.hour, time.min, time.sec
This does not provide you with a sub-second precision though.
Otherwise, parsing the time string is a typical task for tostring and string.match:
local hour, min, sec = gma.show.getvar('TIME'):match('^(%d+)h(%d+)m(%d*%.?%d*)s$')
-- This is usually not needed as Lua will just coerce strings to numbers
-- as soon as you start doing arithmetic on them;
-- it still is good practice to convert the variables to the proper type though
-- (and starts being relevant when you compare them, use them as table keys or call strict functions that check their argument types on them)
hour, min, sec = tonumber(hour), tonumber(min), tonumber(sec)
Pattern explanation:
^ and $ pattern anchors: Match the full string (and not just part of it), making the match fail if the string does not have the right format.
(%d)+h: Capture hours: One or more digits followed by a literal h
(%d)+m: Capture minutes: One or more digits followed by a literal m
(%d*%.?%d*)s: Capture seconds: Zero or more digits followed by an optional dot followed by again zero or more digits, finally ending with a literal s. I do not know the specifics of the format and whether something like .1s, 1.s or 1s is occasionally emitted, but Lua's tonumber supports all of these so there should be no issue. Note that this is slightly overly permissive: It will also match . (just a dot) and an s without any leading digits. You might want (%d+%.?%d+)s instead to force digits appearing before & after the dot.
Lets do it with string method gsub()
local ts = gma.show.getvar('TIME')
local hours = ts:gsub('h.*', '')
local mins = ts:gsub('.*%f[^h]', ''):gsub('%f[m].*', '')
local secs = ts:gsub('.*%f[^m]', ''):gsub('%f[s].*', '')
To make a Timestring i suggest string method format()
-- secs as float
timestring = ('[%s:%s:%.3f]'):format(hours, mins, secs)
-- secs not as float
timestring = ('[%s:%s:%.f]'):format(hours, mins, secs)

Google cloud transcription API

I would like to calculate the time duration for every speaker in a two way conversation call with speaker tag, transcription, time stamp of speaker duration and confidence of it.
For example: I have mp3 file of a customer care support with 2 speaker count. I would like to know the time duration of the speaker with speaker tag, transcription and confidence of the transcription.
I am facing issues with end time and confidence of the transcription. I'm getting confidence as 0 in transcription and end time is not appropriate with actual end time.
audio link: https://drive.google.com/file/d/1OhwQ-xI7Rd-iKNj_dKP2unNxQzMIYlNW/view?usp=sharing
**strong text**
#!pip install --upgrade google-cloud-speech
from google.cloud import speech_v1p1beta1 as speech
import datetime
tag=1
speaker=""
transcript = ''
client = speech.SpeechClient.from_service_account_file('#cloud_credentials')
audio = speech.types.RecognitionAudio(uri=gs_uri)
config = speech.types.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code='en-US',
enable_speaker_diarization=True,
enable_automatic_punctuation=True,
enable_word_time_offsets=True,
diarization_speaker_count=2,
use_enhanced=True,
model='phone_call',
profanity_filter=False,
enable_word_confidence=True)
print('Waiting for operation to complete…')
operation = client.long_running_recognize(config=config, audio=audio)
response = operation.result(timeout=100000)
with open('output_file.txt', "w") as text_file:
for result in response.results:
alternative = result.alternatives[0]
confidence = result.alternatives[0].confidence
current_speaker_tag=-1
transcript = ""
time = 0
for word in alternative.words:
if word.speaker_tag != current_speaker_tag:
if (transcript != ""):
print(u"Speaker {} - {} - {} - {}".format(current_speaker_tag, str(datetime.timedelta(seconds=time)), transcript, confidence), file=text_file)
transcript = ""
current_speaker_tag = word.speaker_tag
time = word.start_time.seconds
transcript = transcript + " " + word.word
if transcript != "":
print(u"Speaker {} - {} - {} - {}".format(current_speaker_tag, str(datetime.timedelta(seconds=time)), transcript, confidence), file=text_file)
print(u"Speech to text operation is completed, output file is created: {}".format('output_file.txt'))
Your code and screenshot in the question differ from each other. However from the screenshot it is understandable that you are creating individual speakers' speech using speech to text speaker diarization method.
Here you can’t calculate different confidence for each individual speaker because the response contains confidence value for each transcript and for individual words. A single transcript may or may not contain multiple speaker’s words depending on the audio.
Also as per the document the response contains all the words with speaker_tag in the last result list. From the doc
The transcript within each result is separate and sequential per result. However, the words list within an alternative includes all the words
from all the results thus far. Thus, to get all the words with speaker
tags, you only have to take the words list from the last result.
For the last result list confidence is 0. You can write the response in the console or any file and debug it yourself.
# Detects speech in the audio file
operation = client.long_running_recognize(config=config, audio=audio)
response = operation.result(timeout=10000)
# check the whole response
with open('output_file.txt', "w") as text_file:
print(response,file=text_file)
Or you can also print individual transcript and confidence for better understanding .eg:
#confidence for each transcript
for result in response.results:
alternative = result.alternatives[0]
print("Transcript: {}".format(alternative.transcript))
print("Confidence: {}".format(alternative.confidence))
For your duration issue with each speaker, you are calculating the start-time and end-time for each word, not for each individual speaker.
The idea should something like this:-
Get the speaker’s first word’s start-time as duration start-time.
Always set every word’s end-time as duration end time ,because we don’t know whether the next word has a different speaker or not.
Look out for speaker change , if the speaker is the same then just add the words in the modified transcript otherwise do the same and also reset the start time for the new speaker.
Eg:
tag=1
speaker=""
transcript = ''
start_time=""
end_time=""
for word_info in words_info:
end_time = word_info.end_time.seconds #tracking the end time of speech
if start_time=='' :
start_time = word_info.start_time.seconds #setting the value only for first time
if word_info.speaker_tag==tag:
speaker=speaker+" "+word_info.word
else:
transcript += "speaker {}: {}-{} - {}".format(tag,str(datetime.timedelta(seconds=start_time)),str(datetime.timedelta(seconds=end_time)),speaker) + '\n'
tag=word_info.speaker_tag
speaker=""+word_info.word
start_time = word_info.start_time.seconds #resetting the starttime as we found a new speaker
transcript += "speaker {}: {}-{} - {}".format(tag,str(datetime.timedelta(seconds=start_time)),str(datetime.timedelta(seconds=end_time)),speaker) + '\n'
I have removed the confidence part in the modified transcript because it will always be 0. Also keep in mind that Speaker diarization is in still beta development and you might not get the exact desired output as you want.

Trying to understand “PT1H” in a date-time string

I'm using the Weather API to pull weather information down as a service for my project. I'm trying to understand some timezone offsets that I can't seem to find information on.
The URL I'm using is:
https://api.weather.gov/gridpoints/VEF/154,48
Here is some sample return values:
"temperature": {
"sourceUnit": "F",
"uom": "unit:degC",
"values": [
{
"validTime": "2019-05-11T16:00:00+00:00/PT1H",
"value": 18.333333333333371
},
{
"validTime": "2019-05-12T04:00:00+00:00/PT2H",
"value": 16.1111111111112
},
{
"validTime": "2019-05-12T21:00:00+00:00/PT4H",
"value": 26.666666666666742
},
...
]
}
I understand the PT means Pacific Timezone. But I cant seem to find any information on the next to characters like 1H, 2H, etc.
If anyone can advise that would be appreciated - Thanks
PT1H = One hour
I understand the PT means Pacific Timezone.
No, incorrect assumption. Not a time zone.
The PT1H represents a duration, a span of time not tied to the timeline. This format is defined in the ISO 8601 standard.
The P marks the beginning, short for “Period” I imagine, a synonym for duration. The T separates any years-months-days portion from any hours-minutes-seconds portion.
So PT1H means “one hour”. Two and a half hours would be PT2H30M.
Parsing
Your input "2019-05-11T16:00:00+00:00/PT1H" combining a starting moment with a duration is part of the ISO 8601 standard.
Such a combo string can be parsed by the Interval.parse method found in the ThreeTen-Extra library.
Interval interval = Interval.parse( "2019-05-11T16:00:00+00:00/PT1H" ) ;
An Interval represents a pair of moments, a pair of Instant objects. In your case here, the second moment is calculated, by adding the duration to the starting moment. We can interrogate for the pair of moments, represented as Instant objects (always in UTC by definition).
Instant start = interval.getStart() ;
Instant stop = interval.getEnd() ;

Ruby on Rails - YouTube API - How to format duration ISO 8601 (PT45S)

I've searched high and low and can't seem to find a straight answer on this. Basically, I'm calling the YouTube API and getting a JSON document back, then parsing it. Everything else is good, but I don't understand how to parse the 'duration' property to display it as human readable.
The 'duration' field comes over as 'PT1H5M34S' - 1 hour 5 minutes 34 seconds
Or it could be 'PT24S' - 24 seconds
Or 'PT4M3S' - 4 minutes 3 seconds
There has to be a way in Ruby to parse this string and make it human readable so that I can just pass in the duration on the fly in my loop and convert it. Any help or guidance is greatly appreciated. I've tried using Date.parse, Time.parse, Date.strptime, along with many other things... Like just gsub-ing the PT out of the string and displaying it, but that doesn't seem right.
Try arnau's ISO8601 parser (https://github.com/arnau/ISO8601)
Usage:
d = ISO8601::Duration.new('PT1H5M34S')
d.to_seconds # => 3934.0
A simple approach to get the number of seconds for videos less than 24 hours:
dur = "PT34M5S"
pattern = "PT"
pattern += "%HH" if dur.include? "H"
pattern += "%MM" if dur.include? "M"
pattern += "%SS"
DateTime.strptime(dur, pattern).seconds_since_midnight.to_i
You can use Rails ActiveSupport::Duration
parsed_duration = ActiveSupport::Duration.parse(youtube_duration)
Time.at(parsed_duration).utc.strftime('%H:%M:%S')

Timecodes in Rails - time or numeric values?

I'm working on a project that stores data on audio tracks and requires the use of timecodes for the start and end points of the track on the audio. I also need to calculate and display the duration of the track. Eg. a track starts at 0:01:30 and finishes at 0:04:12. So its duration is a total of 2 mins and 42 secs.
The trick is that everything needs to be displayed and handled as timecodes, so in the above example the duration needs to be displayed as 0:02:42.
So my question is how you would store the values? The easiest option would be to store the start and end times as Time in the database. Its very easy to calculate the duration and you can utilise the Rails time helpers in the forms. The only painful part is turning the duration back into a time value for display (since if I supply just the number of seconds to strptime it keeps using the current time to fill in the other fields)
The other option that I considered is storing them as numeric values (as the number of seconds). But then I have to write a lot of code to convert them to and from some type of timecode format and I can't use the Rails time helpers.
Is there another idea that I haven't considered? Is there an easy way to calculate and display the duration as a timecode format?
I would store them as seconds or milliseconds. I've been working on a music library manager/audio player in Ruby, and I actually had to write the two methods you would need. It's not that much code:
# Helper method to format a number of milliseconds as a string like
# "1:03:56.555". The only option is :include_milliseconds, true by default. If
# false, milliseconds won't be included in the formatted string.
def format_time(milliseconds, options = {})
ms = milliseconds % 1000
seconds = (milliseconds / 1000) % 60
minutes = (milliseconds / 60000) % 60
hours = milliseconds / 3600000
if ms.zero? || options[:include_milliseconds] == false
ms_string = ""
else
ms_string = ".%03d" % [ms]
end
if hours > 0
"%d:%02d:%02d%s" % [hours, minutes, seconds, ms_string]
else
"%d:%02d%s" % [minutes, seconds, ms_string]
end
end
# Helper method to parse a string like "1:03:56.555" and return the number of
# milliseconds that time length represents.
def parse_time(string)
parts = string.split(":").map(&:to_f)
parts = [0] + parts if parts.length == 2
hours, minutes, seconds = parts
seconds = hours * 3600 + minutes * 60 + seconds
milliseconds = seconds * 1000
milliseconds.to_i
end
It's written for milliseconds, and would be a lot simpler if it was changed to work with seconds.

Resources