With Google Cloud Speech-to-text, why do I get different results for the same audio file, depending on which bucket do I put it into? - google-cloud-speech

I am trying to use Google Cloud Speech-to-text, using the client libraries, from a node.js environment, and I see something I don't understand: I get a different result for the same example audio file, and the same configuration, depending on whether I am using it from the original sample bucket, or from my own bucket.
There are the requests and responses:
The baseline is Google's own test data file, available here: https://storage.googleapis.com/cloud-samples-tests/speech/brooklyn.flac
Request:
{
"config": {
"encoding": "FLAC",
"languageCode": "en-US",
"sampleRateHertz": 16000,
"enableAutomaticPunctuation": true
},
"audio": {
"uri": "gs://cloud-samples-tests/speech/brooklyn.flac"
}
}
Response:
{
"results": [
{
"alternatives": [
{
"transcript": "How old is the Brooklyn Bridge?",
"confidence": 0.9831430315971375
}
]
}
]
}
So far, so good. But, if I download this audio file, re-upload it to my own bucket, and do the same, then:
Request:
{
"config": {
"encoding": "FLAC",
"languageCode": "en-US",
"sampleRateHertz": 16000,
"enableAutomaticPunctuation": true
},
"audio": {
"uri": "gs://goe-transcript-creation/brooklyn.flac"
}
}
Response:
{
"results": [
{
"alternatives": [
{
"transcript": "how old is",
"confidence": 0.8902621865272522
}
]
}
]
}
As you can see this is the same request. The re-uploaded audio data is here: https://storage.googleapis.com/goe-transcript-creation/brooklyn.flac
This the exact same file as in the first example... not a bit of difference.
Still, the results are different; I only get half of the sentence.
What am I missing here? Thanks.
Update 1:
The same thing happens with the CLI tool, too:
$ gcloud ml speech recognize gs://cloud-samples-tests/speech/brooklyn.flac --language-code=en-US
{
"results": [
{
"alternatives": [
{
"confidence": 0.98314303,
"transcript": "how old is the Brooklyn Bridge"
}
]
}
]
}
$ gcloud ml speech recognize gs://goe-transcript-creation/brooklyn.flac --language-code=en-US
ERROR: (gcloud.ml.speech.recognize) INVALID_ARGUMENT: Invalid recognition 'config': bad encoding..
$ gcloud ml speech recognize gs://goe-transcript-creation/brooklyn.flac --language-code=en-US --encoding=FLAC
ERROR: (gcloud.ml.speech.recognize) INVALID_ARGUMENT: Invalid recognition 'config': bad sample rate hertz.
$ gcloud ml speech recognize gs://goe-transcript-creation/brooklyn.flac --language-code=en-US --encoding=FLAC --sample-rate=16000
{
"results": [
{
"alternatives": [
{
"confidence": 0.8902483,
"transcript": "how old is"
}
]
}
]
}
It's also interesting that when pulling the audio from the other bucket, I need to specify encoding and sample rate, otherwise it doesn't work... but it's not necessary when I am using the original test bucket.
Update 2:
If I don't use Google Cloud Storage, but upload the data directly in the speech-to-text request, it works as intended:
$ gcloud ml speech recognize brooklyn.flac --language-code=en-US
{
"results": [
{
"alternatives": [
{
"confidence": 0.98314303,
"transcript": "how old is the Brooklyn Bridge"
}
]
}
]
}
So the problem doesn't seems to be with the recognition itself, but accessing the audio data. The obvious guess would be that maybe it's the fault of the uploading, and the data is somehow corrupted along the way?
We can verify that by pulling the data from the cloud, and comparing with the original. It doesn't seem to be broken.
So maybe it's a problem when the S-T-T service is accessing the storage service? But why with one bucket only? Or is it some kind of file metadata problem?

Related

Thingsboard Upload Converter with multiple timestamps

My device takes measuremets more often than it communicates with MQTT broker, so there can be more than one timestamb in each message, like this:
my/device/telemetry 1651396728000:22,13;1651400328000:25,10;...so on
I want to use built-in Thingsboard MQTT Integration with my custom Upload Converter, but I can't find proper format for result object with multiple timestamps in it (how it was in Gateway Telemetry API)
The output of your data converter should be an array like this:
var result = [
{
"deviceName": "88888888",
"deviceType": "tracker",
"attributes": {
"att1": "val1",
},
"telemetry": {
"ts": 1652738915000,
"values": {
"blah": "blooo",
"External Voltage": 12812
}
}
},
{same},
{similar}
]

How To Convert "created_timestamp" Value To A Valid Date In Python

I'm currently working on a Twitter bot that automatically reply messages, I'm doing this by using tweepy (the official python twitter library)
I need to filter messages based on the created time as I don't want to reply same message twice. Now the problem is that the API endpoint returns created_timestamp as string representation of positive integers.
Below is an example of data returned as per the doc
{
"next_cursor": "AB345dkfC",
"events": [
{ "id": "110", "created_timestamp": "1639919665615", ... },
{ "id": "109", "created_timestamp": "1639865141987", ... },
{ "id": "108", "created_timestamp": "1639827437833", ... },
{ "id": "107", "created_timestamp": "1639825389806", ... },
{ "id": "106", "created_timestamp": "1639825389796", ... },
{ "id": "105", "created_timestamp": "1639825389768", ... },
...
]
}
My question is "How do I convert the created_timestamp to a valid date using python" ?.
You might play with timestamps on this resource
And in your case could use methods like:
timestamp = int('timestamp_string')
datetime.fromtimestamp(timestamp, tz=None)
date.fromtimestamp(timestamp)
From the datetime standard library. But integers after the first line are already well comparable if the task is to distinguish differences between the timestamps.

Twitter API 2.0 - Unable to fetch user.fields

I am using API version 2.0 and unable to fetch the user.fields results. All other parameters seem to be returning results correctly. I'm following this documentation.
url = "https://api.twitter.com/2/tweets/search/all"
query_params = {
"query": "APPL",
"max_results": "10",
"tweet.fields": "created_at,lang,text,author_id",
"user.fields": "name,username,created_at,location",
"expansions": "referenced_tweets.id.author_id",
}
response = requests.request("GET", url, headers=headers, params=query_params).json()
Sample result:
{
'author_id': '1251347502013521925',
'text': 'All conspiracy. But watch for bad news on Apple. Such a vulnerable stocktechnically for the biggest market cap # $2.1T ( Thanks Jay). This is the glue for the bulls. But, they stopped innovating when Steve died, built a fancy office and split the stock. $appl',
'lang': 'en',
'created_at': '2021-06-05T02:33:48.000Z',
'id': '1401004298738311168',
'referenced_tweets': [{
'type': 'retweeted',
'id': '1401004298738311168'
}]
}
As you can see, the following information is not returned: name, username, and location.
Any idea how to retrieve this info?
Your query does actually return the correct data. I tested this myself.
A full example response will be structured like this:
{
"data": [
{
"created_at": "2021-06-05T02:33:48.000Z",
"lang": "en",
"id": "1401004298738311168",
"text": "All conspiracy. But watch for bad news on Apple. Such a vulnerable stocktechnically for the biggest market cap # $2.1T ( Thanks Jay). This is the glue for the bulls. But, they stopped innovating when Steve died, built a fancy office and split the stock. $appl",
"author_id": "1251347502013521925",
"referenced_tweets": [
{
"type": "retweeted",
"id": "1401004298738311168"
}
]
}
],
"includes": {
"users": [
{
"name": "Gary Casper",
"id": "1251347502013521925",
"username": "Hisel1979",
"created_at": "2020-07-11T13:39:58.000Z"
}
]
}
}
The sample result you provided comes from within the data object. However, the expanded object data will be nested in the includes object (in your case name, username, and location). The corresponding user object can be referenced via the author_id field.

Artifactory and Jenkins - get file with newest/biggest custom property

I have generic repository "my_repo". I uploaded files there from jenkins with to paths like my_repo/branch_buildNumber/package.tar.gz and with custom property "tag" like "1.9.0","1.10.0" etc. I want to get item/file with latest/newest tag.
I tried to modify Example 2 from this link ...
https://www.jfrog.com/confluence/display/JFROG/Using+File+Specs#UsingFileSpecs-Examples
... and add sorting and limit the way it was done here ...
https://www.jfrog.com/confluence/display/JFROG/Artifactory+Query+Language#ArtifactoryQueryLanguage-limitDisplayLimitsandPagination
But im getting "unknown property desc" error.
The Jenkins Artifactory Plugin, like most of the JFrog clients, supports File Specs for downloading and uploading generic files.
The File Specs schema is described here. When creating a File Spec for downloading files, you have the option of using the "pattern" property, which can include wildcards. For example, the following spec downloads all the zip files from the my-local-repo repository into the local froggy directory:
{
"files": [
{
"pattern": "my-local-repo/*.zip",
"target": "froggy/"
}
]
}
Alternatively, you can use "aql" instead of "pattern". The following spec, provides the same result as the previous one:
{
"files": [
{
"aql": {
"items.find": {
"repo": "my-local-repo",
"$or": [
{
"$and": [
{
"path": {
"$match": "*"
},
"name": {
"$match": "*.zip"
}
}
]
}
]
}
},
"target": "froggy/"
}
]
}
The allowed AQL syntax inside File Specs does not include everything the Artifactory Query Language allows. For examples, you can't use the "include" or "sort" clauses. These limitations were put in place, to make the response structure known and constant.
Sorting however is still available with File Specs, regardless of whether you choose to use "pattern" or "aql". It is supported throw the "sortBy", "sortOrder", "limit" and "offset" File Spec properties.
For example, the following File Spec, will download only the 3 largest zip file files:
{
"files": [
{
"aql": {
"items.find": {
"repo": "my-local-repo",
"$or": [
{
"$and": [
{
"path": {
"$match": "*"
},
"name": {
"$match": "*.zip"
}
}
]
}
]
}
},
"sortBy": ["size"],
"sortOrder": "desc",
"limit": 3,
"target": "froggy/"
}
]
}
And you can do the same with "pattern", instead of "aql":
{
"files": [
{
"pattern": "my-local-repo/*.zip",
"sortBy": ["size"],
"sortOrder": "desc",
"limit": 3,
"target": "local/output/"
}
]
}
You can read more about File Specs here.
(After answering this question here, we also updated the File Specs documentation with these examples).
After a lot of testing and experimenting i found that there are many ways of solving my main problem (getting latest version of package) but each of way require some function which is available in paid version. Like sort() in AQL or [RELEASE] in REST API. But i found that i still can get JSON with a full list of files and its properties. I can also download each single file. This led me to solution with simple python script. I can't publish whole but only the core which should bu fairly obvious
import requests, argparse
from packaging import version
...
query="""
items.find({
"type" : "file",
"$and":[{
"repo" : {"$match" : \"""" + args.repository + """\"},
"path" : {"$match" : \"""" + args.path + """\"}
}]
}).include("name","repo","path","size","property.*")
"""
auth=(args.username,args.password)
def clearVersion(ver: str):
new = ''
for letter in ver:
if letter.isnumeric() or letter == ".":
new+=letter
return new
def lastestArtifact(response: requests):
response = response.json()
latestVer = "0.0.0"
currentItemIndex = 0
chosenItemIndex = 0
for results in response["results"]:
for prop in results['properties']:
if prop["key"] == "tag":
if version.parse(clearVersion(prop["value"])) > version.parse(clearVersion(latestVer)):
latestVer = prop["value"]
chosenItemIndex = currentItemIndex
currentItemIndex += 1
return response["results"][chosenItemIndex]
req = requests.post(url,data=query,auth=auth)
if args.verbose:
print(req.text)
latest = lastestArtifact(req)
...
I just want to point that THIS IS NOT permanent solution. We just didnt want to buy license yet only because of one single problem. But if there will be more of such problems then we definetly buy PRO subscription.

install plugin for Open Distro

Amazon Elasticsearch Service offers k-Nearest Neighbor (k-NN) search which can enhance search by similarity use cases.
https://aws.amazon.com/about-aws/whats-new/2020/03/build-k-nearest-neighbor-similarity-search-engine-with-amazon-elasticsearch-service/
I tried this official code that I found here...
https://github.com/opendistro-for-elasticsearch/k-NN
PUT /myindex
{
"settings" : {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"my_vector1": {
"type": "knn_vector",
"dimension": 2
},
"my_vector2": {
"type": "knn_vector",
"dimension": 4
},
"my_vector3": {
"type": "knn_vector",
"dimension": 8
}
}
}
}
Getting this error:
"unknown setting [index.knn] please check that any required plugins
are installed, or check the breaking changes documentation for removed
settings"
How do I check if my Elastic installation supports this feature?
t2.small and t2.medium instance types are not supported. (It is not mentioned anywhere in the documentation.) It worked as expected when r5.large instance type was selected.

Resources