Google cloud transcription API - machine-learning

I would like to calculate the time duration for every speaker in a two way conversation call with speaker tag, transcription, time stamp of speaker duration and confidence of it.
For example: I have mp3 file of a customer care support with 2 speaker count. I would like to know the time duration of the speaker with speaker tag, transcription and confidence of the transcription.
I am facing issues with end time and confidence of the transcription. I'm getting confidence as 0 in transcription and end time is not appropriate with actual end time.
audio link: https://drive.google.com/file/d/1OhwQ-xI7Rd-iKNj_dKP2unNxQzMIYlNW/view?usp=sharing
**strong text**
#!pip install --upgrade google-cloud-speech
from google.cloud import speech_v1p1beta1 as speech
import datetime
tag=1
speaker=""
transcript = ''
client = speech.SpeechClient.from_service_account_file('#cloud_credentials')
audio = speech.types.RecognitionAudio(uri=gs_uri)
config = speech.types.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code='en-US',
enable_speaker_diarization=True,
enable_automatic_punctuation=True,
enable_word_time_offsets=True,
diarization_speaker_count=2,
use_enhanced=True,
model='phone_call',
profanity_filter=False,
enable_word_confidence=True)
print('Waiting for operation to complete…')
operation = client.long_running_recognize(config=config, audio=audio)
response = operation.result(timeout=100000)
with open('output_file.txt', "w") as text_file:
for result in response.results:
alternative = result.alternatives[0]
confidence = result.alternatives[0].confidence
current_speaker_tag=-1
transcript = ""
time = 0
for word in alternative.words:
if word.speaker_tag != current_speaker_tag:
if (transcript != ""):
print(u"Speaker {} - {} - {} - {}".format(current_speaker_tag, str(datetime.timedelta(seconds=time)), transcript, confidence), file=text_file)
transcript = ""
current_speaker_tag = word.speaker_tag
time = word.start_time.seconds
transcript = transcript + " " + word.word
if transcript != "":
print(u"Speaker {} - {} - {} - {}".format(current_speaker_tag, str(datetime.timedelta(seconds=time)), transcript, confidence), file=text_file)
print(u"Speech to text operation is completed, output file is created: {}".format('output_file.txt'))

Your code and screenshot in the question differ from each other. However from the screenshot it is understandable that you are creating individual speakers' speech using speech to text speaker diarization method.
Here you can’t calculate different confidence for each individual speaker because the response contains confidence value for each transcript and for individual words. A single transcript may or may not contain multiple speaker’s words depending on the audio.
Also as per the document the response contains all the words with speaker_tag in the last result list. From the doc
The transcript within each result is separate and sequential per result. However, the words list within an alternative includes all the words
from all the results thus far. Thus, to get all the words with speaker
tags, you only have to take the words list from the last result.
For the last result list confidence is 0. You can write the response in the console or any file and debug it yourself.
# Detects speech in the audio file
operation = client.long_running_recognize(config=config, audio=audio)
response = operation.result(timeout=10000)
# check the whole response
with open('output_file.txt', "w") as text_file:
print(response,file=text_file)
Or you can also print individual transcript and confidence for better understanding .eg:
#confidence for each transcript
for result in response.results:
alternative = result.alternatives[0]
print("Transcript: {}".format(alternative.transcript))
print("Confidence: {}".format(alternative.confidence))
For your duration issue with each speaker, you are calculating the start-time and end-time for each word, not for each individual speaker.
The idea should something like this:-
Get the speaker’s first word’s start-time as duration start-time.
Always set every word’s end-time as duration end time ,because we don’t know whether the next word has a different speaker or not.
Look out for speaker change , if the speaker is the same then just add the words in the modified transcript otherwise do the same and also reset the start time for the new speaker.
Eg:
tag=1
speaker=""
transcript = ''
start_time=""
end_time=""
for word_info in words_info:
end_time = word_info.end_time.seconds #tracking the end time of speech
if start_time=='' :
start_time = word_info.start_time.seconds #setting the value only for first time
if word_info.speaker_tag==tag:
speaker=speaker+" "+word_info.word
else:
transcript += "speaker {}: {}-{} - {}".format(tag,str(datetime.timedelta(seconds=start_time)),str(datetime.timedelta(seconds=end_time)),speaker) + '\n'
tag=word_info.speaker_tag
speaker=""+word_info.word
start_time = word_info.start_time.seconds #resetting the starttime as we found a new speaker
transcript += "speaker {}: {}-{} - {}".format(tag,str(datetime.timedelta(seconds=start_time)),str(datetime.timedelta(seconds=end_time)),speaker) + '\n'
I have removed the confidence part in the modified transcript because it will always be 0. Also keep in mind that Speaker diarization is in still beta development and you might not get the exact desired output as you want.

Related

How do I read midi / musicxml files in music21 for solo piano where there can be multiple notes in a voice simultaneously?

I have written a python script to process midi files with music21 and write again a midi file. This works if the solo piano is "simple" in the sense, that there are not multiple pitches / notes played simultaneously in a voice.
https://github.com/githubuser1983/algorithmic_python_music/blob/main/12RootOf2.py
The relevant part from above is:
def parseMidi(fp,part=0):
import os
from music21 import converter
print(fp)
score = converter.parse(fp,quantizePost=True)
print(list(score.elements[0].notesAndRests))
#print([e.partAbbreviation for e in score.elements][0])
from music21 import chord
durs = []
ll0 = []
vols = []
isPauses = []
for p in score.elements[part].notesAndRests:
#print(p)
if type(p)==chord.Chord:
pitches = median([e.pitch.midi-21 for e in p]) # todo: think about chords
vol = median([e.volume.velocity for e in p])
dur = float(p.duration.quarterLength)
#print(pitches)
ll0.append(pitches)
isPause = False
elif (p.name=="rest"):
pitches = 89
vol = 1
dur = float(p.duration.quarterLength)
ll0.append(pitches)
isPause = True
else:
pitches = p.pitch.midi-21
vol = p.volume.velocity
dur = float(p.duration.quarterLength)
ll0.append(pitches)
isPause = False
durs.append(dur/(12*4.0))
vols.append(vol*1.0/127.0)
isPauses.append(isPause)
#print(p.name,p.octave,p.duration.quarterLength)
#print(dir(score))
#print(ll0)
#print(durs)
return ll0,durs,vols,isPauses
Another option would be to to read musicxml instead of midi. What I need for the algorithm to work, is a list of note(s) = (pitch, duration, volume, isPause) for each voice.
Thanks for your help.
Currently, in music21, stream.Voice objects are more of a display concept than a logical concept. Voices and Chords are both simultaneities, and that's all that a MIDI file captures. (In fact, there are pending changes in version 7, to be released this week, that make fewer voices and more chords from MIDI files, in addition to making measures. If there are small overlaps from reverb or from a recorded performance you may get "voices" that an engraver would never print in sheet music.)
In your case, I would probably just take a .flat of the Part object to get rid of Voices (and eventually Measures in v.7), and then run chordify() if you want to ensure there are no overlaps. Otherwise, if you don't want chords at all, you can still take the output of chordify() and find the root of each chord. Several possibilities that all depend on what your sources look like.

AVAudioPlayerNode - Mix between schedule buffers and segments

I'm writing an application where I should play parts of audio files. Each audio file contains audio data for a separate track.
These parts are sections with a begin time and a end time, and I'm trying to play those parts in the order I choose.
So for example, imagine I have 4 sections :
A - B - C - D
and I activate B and D, I want to play, B, then D, then B again, then D, etc..
To make smooth 'jumps" in playback I think it's important to fade in/out start/end sections buffers.
So, I have a basic AVAudioEngine setup, with AVAudioPlayerNode, and a mixer.
For each audio section, I cache some information :
a buffer for the first samples in the section (which I fade in manually)
a tuple for the AVAudioFramePosition, and AVAudioFrameCount of a middle segment
a buffer for the end samples in the audio section (which I fade out manually)
now, when I schedule a section for playing, I say the AVAudioPlayerNode :
schedule the start buffer (scheduleBuffer(_:completionHandler:) no option)
schedule the middle segment (scheduleSegment(_:startingFrame:frameCount:at:completionHandler:))
finally schedule the end buffer (scheduleBuffer(_:completionHandler:) no option)
all at "time" nil.
The problem here is I can hear clic, and crappy sounds at audio sections boundaries and I can't see where I'm doing wrong.
My first idea was the fades I do manually (basically multiplying sample values by a volume factor), but same result without doing that.
I thought I didn't schedule in time, but scheduling sections in advance, A - B - C for example beforehand has the same result.
I then tried different frame position computations, with audio format settings, same result.
So I'm out of ideas here, and perhaps I didn't get the schedule mechanism right.
Can anyone confirm I can mix scheduling buffers and segments in AVAudioPlayerNode ? or should I schedule only buffers or segments ?
I can confirm that scheduling only segments works, playback is perfectly fine.
A little context on how I cache information for audio sections..
In the code below, file is of type AVAudioFile loaded on disk from a URL, begin and end are TimeInterval values, and represent the start/end of my audio section.
let format = file.processingFormat
let startBufferFrameCount: AVAudioFrameCount = 4096
let endBufferFrameCount: AVAudioFrameCount = 4096
let audioSectionStartFrame = framePosition(at: begin, format: format)
let audioSectionEndFrame = framePosition(at: end, format: format)
let segmentStartFrame = audioSectionStartFrame + AVAudioFramePosition(startBufferFrameCount)
let segmentEndFrame = audioSectionEndFrame - AVAudioFramePosition(endBufferFrameCount)
startBuffer = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: startBufferFrameCount)
endBuffer = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: endBufferFrameCount)
file.framePosition = audioSectionStartFrame
try file.read(into: startBuffer)
file.framePosition = segmentEndFrame
try file.read(into: endBuffer)
middleSegment = (segmentStartFrame, AVAudioFrameCount(segmentEndFrame - segmentStartFrame))
frameCount = AVAudioFrameCount(audioSectionEndFrame - audioSectionStartFrame)
Also, the framePosition(at:format:) multiplies the TimeInterval value by the sample rate of the AVAudioFormat passed in.
I cache this information for every audio section, but I hear clicks at section boundaries, no matter if I schedule them in advance or not.
I also tried not mixing buffer and segments when scheduling, but I doesn't change anything, so I start thinking I'm doing wrong frame computations.

Twilio Recording returning negative duration value "-1"

I am facing an issue regarding the recording duration just after the call is ended.
Once the call is completed we are finding the recording and getting the duration but we are recieving the r.duration in the below snippet in negative value i.e "-1" instead returning the duration in seconds format.
client = Twilio::REST::Client.new(TWILIO["ACCOUNT_SID"],TWILIO["ACCOUNT_TOKEN"])
call = client.account.calls.get(call_sid)
if call.present?
call.recordings.list.each do |r|
consultation_duration = (r.duration.to_i/60).to_f
#downloading the recording logic
r.delete
end
end
Can anyone please let me know what is the reason which causing Twilio returning negative duration.
Twilio developer evangelist here.
I think it's because you are dividing that number by 60 when it's already in seconds. You could change your code to the following and it would give you the duration in seconds anyway.
client = Twilio::REST::Client.new(TWILIO["ACCOUNT_SID"],TWILIO["ACCOUNT_TOKEN"])
call = client.account.calls.get(call_sid)
if call.present?
call.recordings.list.each do |r|
consultation_duration = r.duration
#downloading the recording logic
r.delete
end
end
To explain a bit if the call lasted for 30 seconds, you'd be dividing it by 60 which is probably then giving you the behavior you're seeing after casting it.
If you wanted to convert to something like HH:MM:SS for example, you could change your code to do the following:
t = 33 # seconds
Time.at(t).utc.strftime("%H:%M:%S")
And this would give you a nicely formatted time like this:
00:00:33
Hope this helps you
Twilio returns negative duration for recordings not finished yet. It means call still in progress, and final duration not specified yet.

Determine consecutive video clips

I a long video stream, but unfortunately, it's in the form of 1000 15-second long randomly-named clips. I'd like to reconstruct the original video based on some measure of "similarity" of two such 15s clips, something answering the question of "the activity in clip 2 seems like an extension of clip 1". There are small gaps between clips --- a few hundred milliseconds or so each. I can also manually fix up the results if they're sufficiently good, so results needn't be perfect.
A very simplistic approach can be:
(a) Create an automated process to extract the first and last frame of each video-clip in a known image format (e.g. JPG) and name them according to video-clip names, e.g. if you have the video clips:
clipA.avi, clipB.avi, clipC.avi
you may create the following frame-images:
clipA_first.jpg, clipA_last.jpg, clipB_first.jpg, clipB_last.jpg, clipC_first.jpg, clipC_last.jpg
(b) The sorting "algorithm":
1. Create a 'Clips' list of Clip-Records containing each:
(a) clip-name (string)
(b) prev-clip-name (string)
(c) prev-clip-diff (float)
(d) next-clip-name (string)
(e) next-clip-diff (float)
2. Apply the following processing:
for Each ClipX having ClipX.next-clip-name == "" do:
{
ClipX.next-clip-diff = <a big enough number>;
for Each ClipY having ClipY.prev-clip-name == "" do:
{
float ImageDif = ImageDif(ClipX.last-frame.jpg, ClipY.first_frame.jpg);
if (ImageDif < ClipX.next-clip-diff)
{
ClipX.next-clip-name = ClipY.clip-name;
ClipX.next-clip-diff = ImageDif;
}
}
Clips[ClipX.next-clip-name].prev-clip-name = ClipX.clip-name;
Clips[ClipX.next-clip-name].prev-clip-diff = ClipX.next-clip-diff;
}
3. Scan the Clips list to find the record(s) with no <prev-clip-name> or
(if all records have a <prev-clip-name> find the record with the max <prev-clip-dif>.
This is a good candidate(s) to be the first clip in sequence.
4. Begin from the clip(s) found in step (3) and rename the clip-files by adding
a 5 digits number (00001, 00002, etc) at the beginning of its filename and going
from aClip to aClip.next-clip-name and removing the clip from the list.
5. Repeat steps 3,4 until there are no clips in the list.
6. Voila! You have your sorted clips list in the form of sorted video filenames!
...or you may end up with more than one sorted lists (if you have enough
'time-gap' between your video clips).
Very simplistic... but I think it can be effective...
PS1: Regarding the ImageDif() function: You can create a new DifImage, which is the difference of Images ClipX.last-frame.jpg, ClipY.first_frame.jpg and then then sum all pixels of DifImage to a single floating point ImageDif value. You can also optimize the process to abort the difference (or sum process) if your sum is bigger than some limit: You are actually interested in small differences. A ImageDif value which is larger than an (experimental) limit, means that the 2 images differs so much that the 2 clips cannot be one next each other.
PS2: The sorting algorithm order of complexity must be approximately O(n*log(n)), therefore for 1000 video clips it will perform about 3000 image comparisons (or a little more if you optimize the algorithm and you allow it to not find a match for some clips)

Jump to a specific time in a file?

I'm trying to make a VLC script that checks if the "random" button is on, and if so when it jumps to a random file, instead of starting at time=0, it starts at a random time.
So far, it's looking to me like it should be a playlist script and I can get the duration from the playlist object, but in this documentation page doesn't show how to jump to a specific time from within the Lua script.
How can that be done in Lua?
Actually, the documentation does say you can do it...though not in so many words. Here's what it says about the interface for playlist parsers:
VLC Lua playlist modules should define two functions:
* probe(): returns true if we want to handle the playlist in this script
* parse(): read the incoming data and return playlist item(s)
Playlist items use the same format as that expected in the
playlist.add() function (see general lua/README.txt)
If you follow through to the description of playlist.add() it says the items have a big list of fields you can provide. There are plenty of choices (.name, .title, .artist, etc.) But the only required one seems to be .path...which is "the item's full path / URL".
There's no explicit mention of where to seek, but one of the parameters you can choose to provide is .options, said to be "a list of VLC options. It gives fullscreen as an example. If a parallel to --fullscreen works, can other command-line options like --start-time and --stop-time work as well?
On my system they do, and here's the script!
-- randomseek.lua
--
-- A compiled version of this file (.luac) should be put into the proper VLC
-- playlist parsers directory for your system type. See:
--
-- http://wiki.videolan.org/Documentation:Play_HowTo/Building_Lua_Playlist_Scripts
--
-- The file format is extremely simple and is merely alternating lines of
-- filenames and durations, such as if you had a file "example.randomseek"
-- it might contain:
--
-- foo.mp4
-- 3:04
-- bar.mov
-- 10:20
--
-- It simply will seek to a random location in the file and play a random
-- amount of the remaining time in the clip.
function probe()
-- seed the random number since other VLC lua plugins don't seem to
math.randomseed(os.time())
-- tell VLC we will handle anything ending in ".randomseek"
return string.match(vlc.path, ".randomseek$")
end
function parse()
-- VLC expects us to return a list of items, each item itself a list
-- of properties
playlist = {}
-- I'll assume a well formed input file but obviously you should do
-- error checking if writing something real
while true do
playlist_item = {}
line = vlc.readline()
if line == nil then
break --error handling goes here
end
playlist_item.path = line
line = vlc.readline()
if line == nil then
break --error handling goes here
end
for _min, _sec in string.gmatch( line, "(%d*):(%d*)" )
do
duration = 60 * _min + _sec
end
-- math.random with integer argument returns an integer between
-- one and the number passed in inclusive, VLC uses zero based times
start_time = math.random(duration) - 1
stop_time = math.random(start_time, duration - 1)
-- give the viewer a hint of how long the clip will take
playlist_item.duration = stop_time - start_time
-- a playlist item has another list inside of it of options
playlist_item.options = {}
table.insert(playlist_item.options, "start-time="..tostring(start_time))
table.insert(playlist_item.options, "stop-time="..tostring(stop_time))
table.insert(playlist_item.options, "fullscreen")
-- add the item to the playlist
table.insert( playlist, playlist_item )
end
return playlist
end
Just use this:
vlc.var.set(input, "time", time)
There is a seek method in common.lua.
Usage examples:
require 'common'
common.seek(123) -- seeks to 02m03s
common.seek("50%") -- seeks to middle of video

Resources