Nested rows using STRUCT are not supported in Dataflow SQL (GCP) - google-cloud-dataflow

With Dataflow SQL I would like to read a Pub/Sub topic, enrich the message and write the message to a Pub/Sub topic.
Which Dataflow SQL query will create my desired output message?
Pub/Sub input message: {"event_timestamp":1619784049000, "device":{"ID":"some_id"}}
Desired Pub/Sub output message: {"event_timestamp":1619784049000, "device":{“ID":"some_id", “NAME”:”some_name”}}
What I get is: {"event_timestamp":1619784049000, "device":{"ID":"some_id"}, "NAME":"some_name" }
but I need the NAME inside the “device” attribute.
SELECT message_table.device as device, devices.name as NAME
FROM pubsub.topic.project_id.`topic` as message_table
JOIN bigquery.table.project_id.dataflow_sql_dataset.devices as devices
ON devices.device_id = message_table.device.id

Unfortunately, Dataflow SQL does not currently support STRUCT/Sub queries, but we are working on it. Since there are some Apache Beam dependencies preventing its progress (Nested Rows Support, Upgrading Calcite), we cannot provide an ETA at the moment, but you can follow its progress on this issue tracker.

You need to create a struct in the projection (SELECT part)
SELECT STRUCT(message_table.device.ID as ID , devices.name as NAME) as device
FROM pubsub.topic.project_id.`topic` as message_table
JOIN bigquery.table.project_id.dataflow_sql_dataset.devices as devices
ON devices.device_id = message_table.device.id

Related

ZetaSQL - Creating a simple catalog with tables and columns using local service

We are using a Python client binding for ZetaSQL GRPC local service in our application to analyze statements and extract referenced tables and output columns.
It is possible to extract referenced tables using the following simplified Python code and the local service:
import zetasql.local_service as zql
conn = zql.connect()
language_options = conn.GetLanguageOptions(
zql.pb2.LanguageOptionsRequest(maximum_features=True)
)
# Used to allow ZetaSQL parser to parse `CREATE TABLE AS` statments
language_options.supported_statement_kinds.pop()
req = zql.pb2.ExtractTableNamesFromStatementRequest(
sql_statement=sql, options=language_options
)
res = conn.ExtractTableNamesFromStatement(req)
return json.loads(MessageToJson(res))
However, from what I see here, the local service doesn't have the full functionalities of the Java client, mainly creating simple catalog with tables and columns to analyze any SQL statement. Also setting analyzer options doesn't seem to be possible.
Is it possible to analyze SQL statements using ZetaSQL with only the local service? If not, what should be the alternative approach to extract output columns?

Spark Structured Streaming and Neo4j

My goal is to write transformed data from a MongoDB collection into Neo4j using Spark Structured Streaming. According to the Neo4j docs, this should be possible with the "Neo4j Connector for Apache Spark" version 4.1.2.
Batch queries so far work fine. However, with the following example below, I run into an error message:
spark-shell --packages org.mongodb.spark:mongo-spark-connector:10.0.2,org.neo4j:neo4j-connector-apache-spark_2.12:4.1.2_for_spark_3
val dfTxn = spark.readStream.format("mongodb")
.option("spark.mongodb.connection.uri", "mongodb://<IP>:<PORT>")
.option("spark.mongodb.database", "test")
.option("spark.mongodb.collection", "txn")
.option("park.mongodb.read.readPreference.name","primaryPreferred")
.option("spark.mongodb.change.stream.publish.full.document.only", "true")
.option("forceDeleteTempCheckpointLocation", "true").load()
val query = dfPaymentTx.writeStream.format("org.neo4j.spark.DataSource")
.option("url", "bolt://<IP>:<PORT>")
.option("save.mode", "Append")
.option("checkpointLocation", "/tmp/checkpoint/myCheckPoint")
.option("labels", "Account")
.option("node.keys", "txn_snd").start()
This gives me the following error message:
java.lang.UnsupportedOperationException: Data source org.neo4j.spark.DataSource does not support streamed writing
Although the Connector should officially support streaming starting with version 4.x. Does anybody have an idea what I'm doing wrong?
Thanks in advance!
Incase, if the connector doesnt support streaming writes, you can try something like below.
you can leverage foreachBatch() functionality from spark structured streaming and write the data into Neo4j in batch mode.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch
def process_entry(df, id):
df.write.ToNeo4j(url=url, table="mytopic", mode="append", properties=props)
query = df.writeStream.foreachBatch(process_entry).start()
In the above code, you can have your Neo4j Writer logic and you can write the data into database using batch mode.

Stream PubSub to Spanner - Wait.on Step

Requirement is to delete the data in spanner tables before inserting the data from pubsub messages. As MutationGroup does not guarantee the order of execution, separated delete mutations into separate set and so have two sets, one for Delete and other to AddReplace Mutations.
PCollection<Data> dataJson =
pipeLine
.apply(PubsubIO.readStrings().fromSubscription(options.getInputSubscription()))
.apply("ParsePubSubMessage", ParDo.of(new PubSubToDataFn()))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(10))))
;
SpannerWriteResult deleteResult = dataJson
.apply("DeleteDataMutation", MapElements.via(......))
.apply("DeleteData", SpannerIO.write().withSpannerConfig(spannerConfig).grouped());
dataJson
.apply("WaitOnDeleteMutation", Wait.on(deleteResult.getOutput()))
.apply("AddReplaceMutation", MapElements.via(...))
.apply("UpsertInfoToSpanner", SpannerIO.write().withSpannerConfig(spannerConfig).grouped());
This is a streaming dataflow job and I tried multiple Windowing but it never executes "UpsertInfoToSpanner" Step.
How can I fix this issue? Can someone suggest a path forward.
Update:
Requirement is to apply Two Mutation Groups sequential on same input data i.e. Read JSON from PubSub message to delete existing data from multiple tables with mutation group and then insert data reading from the JSON PubSub message.
Re-pasting the comment earlier for better visibility:
The Mutation operations within a single MutationGroup are guaranteed to be executed in order within a single transaction, so I don't see what the issue is here... The reason why Wait.on() never releases is because the output stream that is being waited on is on the global window, so will never be closed in a streaming pipeline.

Using existing pub sub subscription from google data flow

Am using Google Data Flow where in one of the steps am subscribing to a topic in pub sub using already created subscription.
Here is the code snippet
CustomPipelineOptions options =
PipelineOptionsFactory.fromArgs(args).withValidation().as(customPipelineOptions.class);
Pipeline p = Pipeline.create(options);
PCollection<TableRow> datastream = p.apply(PubsubIO.Read.named("Read device data from PubSub") .subscription("projects/<projectID>/subscriptions/<subscriptionname>)
.topic(String.format("projects/%s/topics/%s", options.getSourceProject(), options.getSourceTopic()))
.timestampLabel("ts")
.withCoder(TableRowJsonCoder.of()));
The above code when executed results in the following error:
Error processing pipeline. Causes: (b5e276ef8c76419f): Unrecognized input pubsub_subscription for step s1.
Am passing the right subscription name and project ID.
Not sure why am still getting the above error.
Please kindly help.
Specifying one of 2 sources should be enough: a topic or a subscription.
I suggest you try:
PCollection<TableRow> datastream = p
.apply(PubsubIO.Read.named("Read device data from PubSub")
.topic(String.format("projects/%s/topics/%s", options.getSourceProject(), options.getSourceTopic()))
.timestampLabel("ts")
.withCoder(TableRowJsonCoder.of()));
Also: I suppose you are using the Dataflow 1.9 SDK? You might want to think about moving to the new Beam 2.0.0 release. You can find the reference for PubSub in that SDK here.

What is the correct syntax to use when trying to create a Data Source View to a linked server?

I have tried several statements but this one at least returns data.. but I get the error message: Deferred prepare could not be prepared. Incorrect syntax near')'. Incorrect syntax near the keyword 'DECLARE'. The following statement executed when creating namedquery:
SELECT[vwStatistics].*
FROM
(
***THIS IS MY QUERY***
DECLARE #SQL1 VARCHAR(500)
SET #SQL1 = 'SELECT *
FROM OPENQUERY(PORTAL, ''SELECT DeviceID, Date, Count
FROM printer_stats.Statistics
GROUP BY DeviceID'')'
EXEC (#SQL1)
***END OF MY QUERY***
)
AS[vwStatistics] (Microsoft.AnalysisServices.Controls)
I am new to linked servers and to SSAS. This is our company's first Cube from a linked server. My query does run in Management Studio and creates a SSRS report but it is slow.
Any suggestions would be helpful. Not much info on syntax for this situation on web. I have been looking for any such situation and have not found much help other than trying changes on server. EX: Make sure openrowset is on and reinstall OWC component.. I do not have that capability.
This is what we found to work:
SELECT DeviceID, CAST(statsdt AS CHAR) AS sdt, Count FROM OPENQUERY (
PORTAL, 'select * from (select DeviceID,CAST( Date AS CHAR) statsdt, Count from printer_stats.Statistics) as pstats')

Resources