Ingesting stream counters from Cloud Pubsub to Bigtable using Google Cloud Dataflow? - google-cloud-dataflow

I want to consume data from PubSub and maintain the counter of some data fields and maintain the real-time counters for the same, how can I do the same in Google Cloud DataFlow. I can Sink the data into Bigtable, but how do I update the data?
I am able to update the bigtable, but not counters, it gets replaced with the new values
pubsub_data = (
p
| 'Read from pub sub' >> beam.io.ReadFromPubSub(subscription=input_subscription)
| 'De-Serialize' >> ParDo(PreProcess())
| 'Count' >> ParDo(Countit())
| 'Conversion string to row object' >> beam.ParDo(CreateRowFn())
| "Write Data" >> WriteToBigTable( project_id=PROJECT, instance_id=INSTANCE, table_id=TABLE)
)

You should be able to do this with state and timers, outputting running totals and such from your Countit DoFn.

Related

How to find which pod is taking more data ingestion in AKS cluster using kql query?

I am trying to figure out which pod is producing more billable data ingestion in AKS for log analytics.
I tried several queries and I found only a query that checks the particular node
Is there any query to check the whole pod data ingestion per namespace to find out billable data ingestion?
Thank you?
The default query shows logs per container, and not per pod as you would expected from a Kubernetes-specific logging system.
You can use below KQL query in Log Analytics Workspace -> View Designer -> click on logs button in the header->Logging AKS Test->Container Log.
let startTimestamp = ago(1h);
KubePodInventory
| where TimeGenerated > startTimestamp
| project ContainerID, PodName=Name
| distinct ContainerID, PodName
| join
(
ContainerLog
| where TimeGenerated > startTimestamp
)
on ContainerID
// at this point before the next pipe, columns from both tables are available to be "projected". Due to both
// tables having a "Name" column, we assign an alias as PodName to one column which we actually want
| project TimeGenerated, PodName, LogEntry, LogEntrySource
| order by TimeGenerated desc
For more information please refer this Document

Parsing attributes in Dataflow SQL

Given a Pub/Sub topic, BigQuery enables streaming data to a table using Dataflow SQL syntax.
Let's say you post this message {"a": 1, "b": 2, "c": 3} to a topic. In BigQuery, with the Dataflow engine, you would need to define the my_topic schema as
Step1
event_timestamp: TIMESTAMP
a: INT64
b: INT64
c: INT64
And then creating a Dataflow streaming job using that command, so that it streams every message to a destination BigQuery table.
Step2
gcloud dataflow sql query 'SELECT * FROM pubsub.topic.my_project.my_topic' \
--job-name my_job --region europe-west1 --bigquery-write-disposition write-append \
--bigquery-project my_project --bigquery-dataset staging --bigquery-table my_topic
gcloud pubsub topics publish my_topic --message='{"a": 1, "b": 2, "c": 3}'
​
bq query --nouse_legacy_sql \
'SELECT * FROM my_project.staging.my_topic ORDER BY event_timestamp DESC LIMIT 10'
+---------------------+-----+-----+-----+
| event_timestamp | a | b | c |
+---------------------+-----+-----+-----+
| 2020-10-28 14:21:40 | 1 | 2 | 3 |
At Step 2 I would like to send also --attribute="origin=gcloud,username=gcp" to the Pub/Sub topic. Is is possible to define the schema at Step 1 so that it writes to the table automatically?
I have been trying different things:
attributes: STRUCT in the schema, following this Beam extensions documentation, but all I get is JSON parsing errors in Dataflow
gcloud pubsub topics publish my_topic --message='{"a": 1, "b": 2}' --attribute='c=3' expecting the message to be flattened as in this piece of code, but I get a NULL value for c in the resulting table.
Thank you.
Pub/Sub attributes are of MAP type, but that is not one of Dataflow SQL's supported types. There were discussions about adding support, but I don't know the status of that.
If attributes are important, I suggest creating a custom pipeline using ReadFromPubSub

KSQL queries have strange keys infront of the key value

I have created a stream like this:
CREATE STREAM TEST1 WITH (KAFKA_TOPIC='TEST_1',VALUE_FORMAT='AVRO');
I then query the stream like this via CLI:
SELECT * FROM TEST1;
The results are looking like this:
1571225518167 | \u0000\u0000\u0000\u0000\u0001\u0006key | 7 | 7 | blue
I wonder why the key is formated like this. Is my query somehow wrong? The value should be like this:
1571225518167 | key | 7 | 7 | blue
Your key is in Avro format, which KSQL doesn't support yet.
If you have control over the data producer, write the key in string format (e.g. Kafka Connect use the org.apache.kafka.connect.storage.StringConverter). If not, and you need to use the key e.g. for driving a KSQL table you'd need to re-key the data using KSQL:
CREATE STREAM TEST1_REKEY AS SELECT * FROM TEST1 PARTITION BY my_key_col

Join between Streaming data vs Historical Data in spark

Let say I have transaction data and visit data
visit
| userId | Visit source | Timestamp |
| A | google ads | 1 |
| A | facebook ads | 2 |
transaction
| userId | total price | timestamp |
| A | 100 | 248384 |
| B | 200 | 43298739 |
I want to join transaction data and visit data to do sales attribution. I want to do it realtime whenever transaction occurs (streaming).
Is it scalable to do join between one data and very big historical data using join function in spark?
Historical data is visit, since visit can be anytime (e.g. visit is one year before transaction occurs)
I did join of historical data and streaming data in my project. Here the problem is that you have to cache historical data in RDD and when streaming data comes, you can do join operations. But actually this is a long process.
If you are updating historical data, then you have to keep two copies and use accumulator to work with either copy at once, so it wont affect the the second copy.
For example,
transactionRDD is stream rdd which you are running at some interval.
visitRDD which is historical and you update it once a day.
So you have to maintain two databases for visitRDD. when you are updating one database, transactionRDD can work with cached copy of visitRDD and when visitRDD is updated, you switch to that copy. Actually this is very complicated.
I know this question is very old but lemme share my viewpoint.Today, this can be easily done in Apache Beam. And this job can run on same spark cluster.

Cassandra cql kind of multiget

i want to make a query for two column families at once... I'm using the cassandra-cql gem for rails and my column families are:
users
following
followers
user_count
message_count
messages
Now i want to get all messages from the people a user is following. Is there a kind of multiget with cassandra-cql or is there any other possibility by changing the datamodel to get this kind of data?
I would call your current data model a traditional entity/relational design. This would make sense to use with an SQL database. When you have a relational database you rely on joins to build your views that span multiple entities.
Cassandra does not have any ability to perform joins. So instead of modeling your data based on your entities and relations, you should model it based on how you intend to query it. For your example of 'all messages from the people a user is following' you might have a column family where the rowkey is the userid and the columns are all the messages from the people that user follows (where the column name is a timestamp+userid and the value is the message):
RowKey Columns
-------------------------------------------------------------------
| | TimeStamp0:UserA | TimeStamp1:UserB | TimeStamp2:UserA |
| UserID |------------------|------------------|------------------|
| | Message | Message | Message |
-------------------------------------------------------------------
You would probably also want a column family with all the messages a specific user has written (I'm assuming that the message is broadcast to all users instead of being addressed to one particular user):
RowKey Columns
--------------------------------------------------------
| | TimeStamp0 | TimeStamp1 | TimeStamp2 |
| UserID |------------|------------|-------------------|
| | Message | Message | Message |
--------------------------------------------------------
Now when you create a new message you will need to insert it multiple places. But when you need to list all messages from people a user is following you only need to fetch from one row (which is fast).
Obviously if you support updating or deleting messages you will need to do that everywhere that there is a copy of the message. You will also need to consider what should happen when a user follows or unfollows someone. There are multiple solutions to this problem and your solution will depend on how you want your application to behave.

Resources