Parsing attributes in Dataflow SQL - google-cloud-dataflow

Given a Pub/Sub topic, BigQuery enables streaming data to a table using Dataflow SQL syntax.
Let's say you post this message {"a": 1, "b": 2, "c": 3} to a topic. In BigQuery, with the Dataflow engine, you would need to define the my_topic schema as
Step1
event_timestamp: TIMESTAMP
a: INT64
b: INT64
c: INT64
And then creating a Dataflow streaming job using that command, so that it streams every message to a destination BigQuery table.
Step2
gcloud dataflow sql query 'SELECT * FROM pubsub.topic.my_project.my_topic' \
--job-name my_job --region europe-west1 --bigquery-write-disposition write-append \
--bigquery-project my_project --bigquery-dataset staging --bigquery-table my_topic
gcloud pubsub topics publish my_topic --message='{"a": 1, "b": 2, "c": 3}'
​
bq query --nouse_legacy_sql \
'SELECT * FROM my_project.staging.my_topic ORDER BY event_timestamp DESC LIMIT 10'
+---------------------+-----+-----+-----+
| event_timestamp | a | b | c |
+---------------------+-----+-----+-----+
| 2020-10-28 14:21:40 | 1 | 2 | 3 |
At Step 2 I would like to send also --attribute="origin=gcloud,username=gcp" to the Pub/Sub topic. Is is possible to define the schema at Step 1 so that it writes to the table automatically?
I have been trying different things:
attributes: STRUCT in the schema, following this Beam extensions documentation, but all I get is JSON parsing errors in Dataflow
gcloud pubsub topics publish my_topic --message='{"a": 1, "b": 2}' --attribute='c=3' expecting the message to be flattened as in this piece of code, but I get a NULL value for c in the resulting table.
Thank you.

Pub/Sub attributes are of MAP type, but that is not one of Dataflow SQL's supported types. There were discussions about adding support, but I don't know the status of that.
If attributes are important, I suggest creating a custom pipeline using ReadFromPubSub

Related

Ingesting stream counters from Cloud Pubsub to Bigtable using Google Cloud Dataflow?

I want to consume data from PubSub and maintain the counter of some data fields and maintain the real-time counters for the same, how can I do the same in Google Cloud DataFlow. I can Sink the data into Bigtable, but how do I update the data?
I am able to update the bigtable, but not counters, it gets replaced with the new values
pubsub_data = (
p
| 'Read from pub sub' >> beam.io.ReadFromPubSub(subscription=input_subscription)
| 'De-Serialize' >> ParDo(PreProcess())
| 'Count' >> ParDo(Countit())
| 'Conversion string to row object' >> beam.ParDo(CreateRowFn())
| "Write Data" >> WriteToBigTable( project_id=PROJECT, instance_id=INSTANCE, table_id=TABLE)
)
You should be able to do this with state and timers, outputting running totals and such from your Countit DoFn.

How to find which pod is taking more data ingestion in AKS cluster using kql query?

I am trying to figure out which pod is producing more billable data ingestion in AKS for log analytics.
I tried several queries and I found only a query that checks the particular node
Is there any query to check the whole pod data ingestion per namespace to find out billable data ingestion?
Thank you?
The default query shows logs per container, and not per pod as you would expected from a Kubernetes-specific logging system.
You can use below KQL query in Log Analytics Workspace -> View Designer -> click on logs button in the header->Logging AKS Test->Container Log.
let startTimestamp = ago(1h);
KubePodInventory
| where TimeGenerated > startTimestamp
| project ContainerID, PodName=Name
| distinct ContainerID, PodName
| join
(
ContainerLog
| where TimeGenerated > startTimestamp
)
on ContainerID
// at this point before the next pipe, columns from both tables are available to be "projected". Due to both
// tables having a "Name" column, we assign an alias as PodName to one column which we actually want
| project TimeGenerated, PodName, LogEntry, LogEntrySource
| order by TimeGenerated desc
For more information please refer this Document

KSQL queries have strange keys infront of the key value

I have created a stream like this:
CREATE STREAM TEST1 WITH (KAFKA_TOPIC='TEST_1',VALUE_FORMAT='AVRO');
I then query the stream like this via CLI:
SELECT * FROM TEST1;
The results are looking like this:
1571225518167 | \u0000\u0000\u0000\u0000\u0001\u0006key | 7 | 7 | blue
I wonder why the key is formated like this. Is my query somehow wrong? The value should be like this:
1571225518167 | key | 7 | 7 | blue
Your key is in Avro format, which KSQL doesn't support yet.
If you have control over the data producer, write the key in string format (e.g. Kafka Connect use the org.apache.kafka.connect.storage.StringConverter). If not, and you need to use the key e.g. for driving a KSQL table you'd need to re-key the data using KSQL:
CREATE STREAM TEST1_REKEY AS SELECT * FROM TEST1 PARTITION BY my_key_col

What are series and bucket in InfluxDb

While trying to understand different concepts of InfluxDb I came across this documentation, where there is a comparision of terms with SQL database.
An InfluxDB measurement is similar to an SQL database table.
InfluxDB tags are like indexed columns in an SQL database.
InfluxDB fields are
like unindexed columns in an SQL database.
InfluxDB points are similar
to SQL rows.
But there are couple of other terminology which I came across, which I could not clearly understand and wondering if there is an SQL equivalent for that.
Series
Bucket
From what I understand from the documentation
series is the collection of data that share a retention policy,
measurement, and tag set.
Does this mean a series is a subset of data in a database table? Or is it like database views ?
I could not see any documentation explaining buckets. I guess this is a new concept in 2.0 release
Can someone please clarify these two concepts.
I have summarized my understanding below:
A bucket is named location with retention policy where time-series data is stored.
A series is a logical grouping of data defined by shared measurement, tag and field.
A measurement is similar to an SQL database table.
A tag is similar to indexed columns in an SQL database.
A field is similar to unindexed columns in an SQL database.
A point is similar to SQL row.
For example, a SQL table workdone:
Email
Status
time
Completed
lorr#influxdb.com
start
1636775801000000000
76
lorr#influxdb.com
finish
1636775868000000000
120
marv#influxdb.com
start
1636775801000000000
0
marv#influxdb.com
finish
1636775868000000000
20
cliff#influxdb.com
start
1636775801000000000
54
cliff#influxdb.com
finish
1636775868000000000
56
The columns Email and Status are indexed.
Hence:
Measurement: workdone
Tags: Email, Status
Field: Completed
Series (Cardinality = 3 x 2 = 6):
Measurement: workdone; Tags: Email: lorr#influxdb.com, Status: start; Field: Completed
Measurement: workdone; Tags: Email: lorr#influxdb.com, Status: finish; Field: Completed
Measurement: workdone; Tags: Email: marv#influxdb.com, Status: start; Field: Completed
Measurement: workdone; Tags: Email: marv#influxdb.com, Status: finish; Field: Completed
Measurement: workdone; Tags: Email: cliff#influxdb.com, Status: start; Field: Completed
Measurement: workdone; Tags: Email: cliff#influxdb.com, Status: finish; Field: Completed
Splitting a logical series across multiple buckets may not improve performance but may complicate flux query as need to include multiple buckets.
The InfluxDb document that you link to has an example of what a Series is, even if they don't label it as such. In InfluxDb, you can think of each combination of measurement and tags as being in it's own "table". The documentation splits it like this.
This table in SQL:
+---------+---------+---------------------+--------------+
| park_id | planet | time | #_foodships |
+---------+---------+---------------------+--------------+
| 1 | Earth | 1429185600000000000 | 0 |
| 2 | Saturn | 1429185601000000000 | 3 |
+---------+---------+---------------------+--------------+
Becomes these two Series in InfluxDb:
name: foodships
tags: park_id=1, planet=Earth
----
name: foodships
tags: park_id=2, planet=Saturn
...etc...
This has implications when you query for the data, and is also the reason why the recommendation is that you don't have tag values with high cardinality. For example, if you had a tag of temperature (especially if it was a precise to multiple decimal points) that InfluxDb would be creating a "table" for each potential combination of tag values.
A Bucket is much easier to understand. It's just a combination of a database with a retention policy. In previous versions of InfluxDb these were separate concepts which have now been combined.
According to the InfluxDB glossary:
Bucket
A bucket is a named location where time-series data is stored in InfluxDB 2.0. In InfluxDB 1.8+, each combination of a
database and a retention policy (database/retention-policy) represents
a bucket. Use the InfluxDB 2.0 API compatibility endpoints included
with InfluxDB 1.8+ to interact with buckets.
Series
A logical grouping of data defined by shared measurement, tag
set, and field key.

Need to get data recursively in GraphQl

Is it possible to get data in one call for parent and child in GraphQL
Sample data set
============================
| Id | Expression |
===========================
| 1 | N_2 + N_3 |
===========================
| 2 | N_4 + N_5 |
===========================
| 3 | N_6 + N_7 |
===========================
So basically what I want is in a single call like
I want to query Id - 1 and fetch it's expression N_2 + N_3
parse N_2 + N_3 to get 2 and 3 and get the data for 2 and 3 from
table.
I want to get this in a single database hit.
I have read about nested queries but I guess it hits the database again
for nested queries.
Kindly let me know if there is any way or Graph databases like Neo4j is the only solution for such problem.
I have achieved this in Neo4j but for this I have used SQLAlchemy to fetch the data and then created it's child in the list of dicts like
{id:1, child: [1, 2, 4, 5]}
Is is possible even in Neo4j to achieve this without using the SQLAlchemy problem.

Resources