With Data Fusion transform a BigQuery table (source, containing ARRAY/STRUCT) via Transform "Wrangler" to a corresponding "normalized" BigQuery table - normalization

(1) There is a BigQuery source-table like ...
column_name | is_nullable | data_type
OrderId | YES | STRING
items | NO | ARRAY<STRUCT<articleId STRING, quantity FLOAT64INT64>>
"OrderId" is supposed to be the Key from ralational table perspective.
(2) Now I'd like to normalize the ARRAY/STRUCT record to a separate table.
To achieve this I'm using the Transform "Wrangler".
NOTE: It's the "Wrangler" from the Transport section of Data Fusion's Studio! When trying to open the "Wrangler" via the hamburger menu, and selecting the BQ source table it's telling: BigQuery type STRUCT is not supported.
The output of the source-table is linked to the input of Wrangler.
In Wrangler I defined ...
Input field name: *
Precondition: false
Directives / Recipe: keep combiOrderId,items,articleId,quantity
Output Schema (Name | Type | Null): -- (according to the source table, JSON attached below)
combiOrderId | string | yes
items | array | no
record [ {articleId | string | yes}, {quantity | float | yes} ]
Wrangler parameters screen
(3) The BQ sink table takes the Wrangler Output as Input Schema, and I defined the final schema as (Name | Type | Null)
combiOrderId | string | yes
articleId | string | yes
quantity | float | yes
Now, when running the pipeline (Preview mode), the following error message is logged:
Problem converting into output record. Reason : Unable to decode array
'items'
(full message very below)
Any hint or an alternative solution would be very welcomed :-)
Thank you.
JSON of Wrangler's Output Schema:
[
{
"name": "etlSchemaBody",
"schema": {
"type": "record",
"name": "etlSchemaBody",
"fields": [
{
"name": "combiOrderId",
"type": [
"string",
"null"
]
},
{
"name": "items",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "a6adafef5943d4757b2fad43a10732952",
"fields": [
{
"name": "articleId",
"type": [
"string",
"null"
]
},
{
"name": "quantity",
"type": [
"float",
"null"
]
}
]
}
}
}
]
}
}
]
Full (first) error log:
java.lang.Exception: Stage:Normalize-items - Reached error threshold 1, terminating processing due to error : Problem converting into output record. Reason : Unable to decode array 'items'
at io.cdap.wrangler.Wrangler.transform(Wrangler.java:412) ~[1576661389534-0/:na]
at io.cdap.wrangler.Wrangler.transform(Wrangler.java:94) ~[1576661389534-0/:na]
at io.cdap.cdap.etl.common.plugin.WrappedTransform.lambda$transform$5(WrappedTransform.java:90) ~[cdap-etl-core-6.1.0.jar:na]
at io.cdap.cdap.etl.common.plugin.Caller$1.call(Caller.java:30) ~[cdap-etl-core-6.1.0.jar:na]
at io.cdap.cdap.etl.common.plugin.WrappedTransform.transform(WrappedTransform.java:89) ~[cdap-etl-core-6.1.0.jar:na]
at io.cdap.cdap.etl.common.TrackedTransform.transform(TrackedTransform.java:74) ~[cdap-etl-core-6.1.0.jar:na]
at io.cdap.cdap.etl.spark.function.TransformFunction.call(TransformFunction.java:50) ~[hydrator-spark-core2_2.11-6.1.0.jar:na]
at io.cdap.cdap.etl.spark.Compat$FlatMapAdapter.call(Compat.java:126) ~[hydrator-spark-core2_2.11-6.1.0.jar:na]
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:128) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.scheduler.Task.run(Task.scala:109) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) [spark-core_2.11-2.3.3.jar:2.3.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_232]
Caused by: io.cdap.wrangler.api.RecipeException: Problem converting into output record. Reason : Unable to decode array 'items'
at io.cdap.wrangler.executor.RecipePipelineExecutor.execute(RecipePipelineExecutor.java:102) ~[wrangler-core-4.1.3.jar:na]
at io.cdap.wrangler.Wrangler.transform(Wrangler.java:384) ~[1576661389534-0/:na]
... 25 common frames omitted
Caused by: io.cdap.wrangler.utils.RecordConvertorException: Unable to decode array 'items'
at io.cdap.wrangler.utils.RecordConvertor.decodeArray(RecordConvertor.java:382) ~[wrangler-core-4.1.3.jar:na]
at io.cdap.wrangler.utils.RecordConvertor.decode(RecordConvertor.java:142) ~[wrangler-core-4.1.3.jar:na]
at io.cdap.wrangler.utils.RecordConvertor.decodeUnion(RecordConvertor.java:368) ~[wrangler-core-4.1.3.jar:na]
at io.cdap.wrangler.utils.RecordConvertor.decode(RecordConvertor.java:152) ~[wrangler-core-4.1.3.jar:na]
at io.cdap.wrangler.utils.RecordConvertor.decodeRecord(RecordConvertor.java:85) ~[wrangler-core-4.1.3.jar:na]
at io.cdap.wrangler.utils.RecordConvertor.toStructureRecord(RecordConvertor.java:56) ~[wrangler-core-4.1.3.jar:na]
at io.cdap.wrangler.executor.RecipePipelineExecutor.execute(RecipePipelineExecutor.java:99) ~[wrangler-core-4.1.3.jar:na]
... 26 common frames omitted

Adding the comment as answer.
Regarding debugging the error:
The easiest way to check what the error could be is to navigate from Wrangler. You can do this by following these steps,
Go to Wrangler connection :/cdap/ns/default/connections
Click on BQ source (or create a BigQuery connection)
Navigate to BQ table and click on it.
This will take you to wrangler workspace (tabbed view)
From there you can apply all the transformations and click "Create Pipeline"
After this point you should see your source and wrangler transform already configured. You can then add a sink and run preview to test if things wrk
To address your other point: Wrangler only supports array type in BQ source. It doesn't support reading STRUCT types from BigQuery. My guess would be thats why you are seeing this issue. issues.cask.co/browse/CDAP-15665

Related

Error while reading data, error message: JSON table encountered too many errors, giving up. Rows

I am having two files and doing a inner join using CoGroupByKey in apache-beam.
When I am writing rows to bigquery,iy gives me following error.
RuntimeError: BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_614_c4a563c648634e9dbbf7be3a56578b6d_2f196decc8984a0d83dee92e19054ffb failed. Error Result: <ErrorProto
location: 'gs://dataflow4bigquery/temp/bq_load/06bfafaa9dbb47338ad4f3a9914279fe/dotted-transit-351803.test_dataflow.inner_join/f714c1ac-c234-4a37-bf51-c725a969347a'
message: 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.'
reason: 'invalid'> [while running 'WriteToBigQuery/BigQueryBatchFileLoads/WaitForDestinationLoadJobs']
-----------------code-----------------------
from apache_beam.io.gcp.internal.clients import bigquery
import apache_beam as beam
def retTuple(element):
thisTuple=element.split(',')
return (thisTuple[0],thisTuple[1:])
def jstr(cstr):
import datetime
left_dict=cstr[1]['dep_data']
right_dict=cstr[1]['loc_data']
for i in left_dict:
for j in right_dict:
id,name,rank,dept,dob,loc,city=([cstr[0]]+i+j)
json_str={ "id":id,"name":name,"rank":rank,"dept":dept,"dob":datetime.datetime.strptime(dob, "%d-%m-%Y").strftime("%Y-%m-%d").strip("'"),"loc":loc,"city":city }
return json_str
table_spec = 'dotted-transit-351803:test_dataflow.inner_join'
table_schema = 'id:INTEGER,name:STRING,rank:INTEGER,dept:STRING,dob:STRING,loc:INTEGER,city:STRING'
gcs='gs://dataflow4bigquery/temp/'
p1 = beam.Pipeline()
# Apply a ParDo to the PCollection "words" to compute lengths for each word.
dep_rows = (
p1
| "Reading File 1" >> beam.io.ReadFromText('dept_data.txt')
| 'Pair each employee with key' >> beam.Map(retTuple) # {149633CM : [Marco,10,Accounts,1-01-2019]}
)
loc_rows = (
p1
| "Reading File 2" >> beam.io.ReadFromText('location.txt')
| 'Pair each loc with key' >> beam.Map(retTuple) # {149633CM : [9876843261,New York]}
)
results = ({'dep_data': dep_rows, 'loc_data': loc_rows}
| beam.CoGroupByKey()
| beam.Map(jstr)
| beam.io.WriteToBigQuery(
custom_gcs_temp_location=gcs,
table=table_spec,
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
additional_bq_parameters={'timePartitioning': {'type': 'DAY'}}
)
)
p1.run().wait_until_finish()
I am running it on gcp using dataflow runner.
When printing json_str string the output is a valid json.
Eg:
{'id': '149633CM', 'name': 'Marco', 'rank': '10', 'dept': 'Accounts', 'dob': '2019-01-31', 'loc': '9204232778', 'city': 'New York'}
{'id': '212539MU', 'name': 'Rebekah', 'rank': '10', 'dept': 'Accounts', 'dob': '2019-01-31', 'loc': '9995440673', 'city': 'Denver'}
Schema which I have defined is also correct.
But,getting that error,when loading it to bigquery.
After doing some reaseach,I finally solved it.
It was a schema error.
Id column value is like 149633CM
I had given data type of Id as INTEGER,but when I tried to load json with bq and schema as --autodetect, bq marked datatype of Id as STRING.
after that,I changed datatype of Id column as STRING in my schema in code.
And,it worked.The table is created and got loaded.
But,I am not getting one thing,if starting 6 characters are numbers in Id column,why INTEGER is not wroking and STRING is working?
because the data type is parsed on the whole field value and not only the first 6 characters. if you drop the last 2 characters, you can put INTEGER

Apache Beam Python - SQL Transform with named PCollection Issue

I am trying to execute the below code in which I am using Named Tuple for PCollection and SQL transform for doing a simple select.
As per the video link (4:06) : https://www.youtube.com/watch?v=zx4p-UNSmrA.
Instead of using PCOLLECTION in SQLTransform query, named PCollections can also be provided as below.
Code Block
class EmployeeType(typing.NamedTuple):
name:str
age:int
beam.coders.registry.register_coder(EmployeeType, beam.coders.RowCoder)
pcol = p | "Create" >> beam.Create([EmployeeType(name="ABC", age=10)]).with_output_types(EmployeeType)
(
{'a':pcol} | SqlTransform(
""" SELECT age FROM a """)
| "Map" >> beam.Map(lambda row: row.age)
| "Print" >> beam.Map(print)
)
p.run()
However the below code block errors out with error
Caused by: org.apache.beam.vendor.calcite.v1_28_0.org.apache.calcite.sql.validate.SqlValidatorException: Object 'a' not found
Apache Beam SDK used is 2.35.0, are there any known limitation in using named PCollection

Gremlin, combine two queries and join data

I have a problem making a query for the following case:
+--------------------hasManager-------------------+
| | |
| property:isPersonalMngr=true (bool) |
| v
[ Employee ]-- hasShift -->[ Shift ]-- hasManager -->[ Manager ]
| | |
| | property:isPersonalMngr=false (bool)
| |
| property:name (text)
|
property:baseShift (bool)
For a manager 'John', who is managing shifts and can also be a personal manager of an empoyee, I want return all the employees he's managing with the list of shifts for each employee. Each empoyee has a 'baseShift' (say: 'night' / 'day') and a scheduled shift ('wed123')
Eg:
[ 'Employee1', [ 'night', 'wed123', 'sat123' ]]
[ 'Employee2', [ 'day', 'mon123', 'tue123' ]]
For the shift employees I have this:
g.V('John').in('hasManager').in('hasShift').hasLabel('Employee')
For the personal managed I have this:
g.V('John').in('hasManager').hasLabel('Employee')
How do I combine these two AND add the name property of the shift in a list?
Thanks.
To test this, I created the following graph. Hope this fits your data model from above:
g.addV('Manager').property(id,'John').as('john').
addV('Manager').property(id,'Terry').as('terry').
addV('Manager').property(id,'Sally').as('sally').
addV('Employee').property(id,'Tom').as('tom').
addV('Employee').property(id,'Tim').as('tim').
addV('Employee').property(id,'Lisa').as('lisa').
addV('Employee').property(id,'Sue').as('sue').
addV('Employee').property(id,'Chris').as('chris').
addV('Employee').property(id,'Bob').as('bob').
addV('Shift').property('name','mon123').as('mon123').
addV('Shift').property('name','tues123').as('tues123').
addV('Shift').property('name','sat123').as('sat123').
addV('Shift').property('name','wed123').as('wed123').
addE('hasManager').from('tom').to('john').property('isPersonalMngr',true).
addE('hasManager').from('tim').to('john').property('isPersonalMngr',true).
addE('hasManager').from('lisa').to('terry').property('isPersonalMngr',true).
addE('hasManager').from('sue').to('terry').property('isPersonalMngr',true).
addE('hasManager').from('chris').to('sally').property('isPersonalMngr',true).
addE('hasManager').from('bob').to('sally').property('isPersonalMngr',true).
addE('hasShift').from('tom').to('mon123').property('baseShift','day').
addE('hasShift').from('tim').to('tues123').property('baseShift','night').
addE('hasShift').from('lisa').to('wed123').property('baseShift','night').
addE('hasShift').from('sue').to('sat123').property('baseShift','night').
addE('hasShift').from('chris').to('wed123').property('baseShift','day').
addE('hasShift').from('bob').to('sat123').property('baseShift','day').
addE('hasShift').from('bob').to('mon123').property('baseShift','day').
addE('hasShift').from('tim').to('wed123').property('baseShift','day').
addE('hasManager').from('mon123').to('terry').property('isPersonalMngr',false).
addE('hasManager').from('tues123').to('sally').property('isPersonalMngr',false).
addE('hasManager').from('wed123').to('john').property('isPersonalMngr',false).
addE('hasManager').from('sat123').to('terry').property('isPersonalMngr',false)
From this, the follow query generates an output in the format that you're looking for:
gremlin> g.V('John').
union(
inE('hasManager').has('isPersonalMngr',true).outV(),
inE('hasManager').has('isPersonalMngr',false).outV().in('hasShift')).
dedup().
map(union(id(),out('hasShift').values('name').fold()).fold())
==>[Tom,[mon123]]
==>[Tim,[tues123,wed123]]
==>[Lisa,[wed123]]
==>[Chris,[wed123]]
A note on your data model - you could likely simplify things by having two different types of edges for hasManager and that would remove the need for a boolean property on those edges. Instead, you could have hasOrgManager and hasShiftManager edges and that would remove the need for the property checks when traversing those edges.

Is there a solution in RML for multiple complex entities in one data element (cell) of the input without cleaning the input data?

I have a list of person names like, for example, this except (Person is the column name):
Person
"Wilson, Charles; Harris Arthur"
"White, D.
Arthur Harris"
Note that the multiple persons are mentioned in different ways and are separated differently.
I would like to use the RDF Mapping Language https://rml.io/ to create the following RDF without cleaning (or changing) the input data:
:Wilson a foaf:Person;
foaf:firstName "Charles";
foaf:lastName "Wilson" .
:Harris a foaf:Person;
foaf:firstName "Arthur";
foaf:lastName "Harris" .
:White a foaf:Person;
foaf:firstName "D.";
foaf:lastName "White" .
Note that Arthur Harris is mentioned twice in the input data, but only a single RDF resource is created.
I use the function ontology https://fno.io/ and created a custom java method. Based on the argument mode a list of person properties are returned (e.g. only the URIs or only the first names).
public static List<String> getPersons(String value, String mode) {
if(mode == null || value.trim().isEmpty())
return Arrays.asList();
List<String> results = new ArrayList<>();
for(Person p : getAllPersons(value)) {
if(mode.trim().isEmpty() || mode.equals("URI")) {
results.add("http://example.org/person/" + p.getLastName());
} else if(mode.equals("firstName")) {
results.add(p.getFirstName());
} else if(mode.equals("lastName")) {
results.add(p.getLastName());
} else if(mode.equals("fullName")) {
results.add(p.getFullName());
}
}
return results;
}
Assume that the getAllPersons method correctly extracts the persons from a given string, like the ones above.
In order to extract multiple persons from one cell I call the getPersons function in a subjectMap like this:
:tripleMap a rr:TriplesMap .
:tripleMap rml:logicalSource :ExampleSource .
:tripleMap rr:subjectMap [
fnml:functionValue [
rr:predicateObjectMap [
rr:predicate fno:executes ;
rr:objectMap [ rr:constant cf:getPersons ]
] ;
rr:predicateObjectMap [
rr:predicate grel:valueParameter ;
rr:objectMap [ rml:reference "Person" ] # the column name
] ;
rr:predicateObjectMap [
rr:predicate grel:valueParameter2 ;
rr:objectMap [ rr:constant "URI" ] # the mode
]
];
rr:termType rr:IRI ;
rr:class foaf:Person
] .
I use RMLMapper https://github.com/RMLio/rmlmapper-java, however, it only allows to return one subject for each line, see https://github.com/RMLio/rmlmapper-java/blob/master/src/main/java/be/ugent/rml/Executor.java#L292 .
That is why I wrote a List<ProvenancedTerm> getSubjects(Term triplesMap, Mapping mapping, Record record, int i) method and replaced it accordingly.
This leads to the following result:
:Wilson a foaf:Person .
:Harris a foaf:Person .
:White a foaf:Person .
I know that this extension is incompatible with the RML specification https://rml.io/specs/rml/ where the following is stated:
It [a triples map] must have exactly one subject map that specifies how to generate a subject for each row/record/element/object of the logical source (database/CSV/XML/JSON data source accordingly).
If I proceed to also add first name resp. last name, the following predicateObjectMap could be added:
:tripleMap rr:predicateObjectMap [
rr:predicate foaf:firstName;
rr:objectMap [
fnml:functionValue [
rr:predicateObjectMap [
rr:predicate fno:executes ;
rr:objectMap [ rr:constant cf:getPersons ]
] ;
rr:predicateObjectMap [
rr:predicate grel:valueParameter ;
rr:objectMap [ rml:reference "Person" ] # the column name
] ;
rr:predicateObjectMap [
rr:predicate grel:valueParameter2 ;
rr:objectMap [ rr:constant "firstName" ] # the mode
]
]
]
] .
Because a predicateObjectMap is evaluated for each subject and multiple subjects are returned now, every person resource will get the first name of every person. In order to make it more clear, it looks like this:
:Wilson a foaf:Person;
foaf:firstName "Charles" ;
foaf:firstName "Arthur" ;
foaf:firstName "D." .
:Harris a foaf:Person;
foaf:firstName "Charles" ;
foaf:firstName "Arthur" ;
foaf:firstName "D." .
:White a foaf:Person;
foaf:firstName "Charles" ;
foaf:firstName "Arthur" ;
foaf:firstName "D." .
My question is: Is there a solution or work-around in RML for multiple complex entities (e.g. persons having first and last names) in one data element (cell) of the input without cleaning (or changing) the input data?
Maybe this issue is related to my question: https://www.w3.org/community/kg-construct/track/issues/3
It would also be fine if such a use case is not meant to be solved by a mapping framework like RML. If this is the case, what could be alternatives? For example, a handcrafted extraction pipeline that generates RDF?
As far as I am aware, what you are trying to do is not possible using FnO functions and join conditions.
However, what you could try is specifying a clever rml:query or rml:iterator which splits the complex values before they reach the RMLMapper. Whether this is possible depends on the specific source database, though.
For instance, if the source is a SQL Server database, you could use the function STRING_SPLIT. Or if it is a PostgreSQL database, you could use STRING_TO_ARRAY together with unnest. (Since different separators are used in the data, it is possible you would have to call STRING_SPLIT or STRING_TO_ARRAY once for each different separator.
If you provide more information about the underlying database, I can update this answer with an example
(Note: I contribute to RML and its technologies.)
As I understood, you have a normalization problem (multi-value cells). Definitely, what you are asking for is to have a dataset in 1NF, see: https://en.wikipedia.org/wiki/First_normal_form
To address these usual heterogeneity problems in CSV files you can use CSV on the Web annotations (W3C recommendation). More in detail, the property you are asking for in this case is csvw:separator (https://www.w3.org/TR/tabular-data-primer/#sequence-values).
However, there are not many parsers for CSVW and the semantics of its properties to generate RDF is not very clear. We've been working on a solution that works with CSVW and RML+FnO to generate virtual KGs from tabular data (having also a SPARQL query as input and not transforming the input dataset to RDF). The output of our proposal is a well-formed database with a standard [R2]RML mapping, so any [R2]RML-compliant could be used to answer queries or to materialize the Knowledge Graph. Although we currently do not support the materialization step, it is in our ToDo list.
You can take a look at the contribution (under review right now): http://www.semantic-web-journal.net/content/enhancing-virtual-ontology-based-access-over-tabular-data-morph-csv
Website: https://morph.oeg.fi.upm.es/tool/morph-csv

Save last record according to a composite key. ksqlDB 0.6.0

I have a Kafka topic with the following data flow ( ksqldb_topic_01 ):
% Reached end of topic ksqldb_topic_01 [0] at offset 213
{"city":"Sevilla","temperature":20,"sensorId":"sensor03"}
% Reached end of topic ksqldb_topic_01 [0] at offset 214
{"city":"Madrid","temperature":5,"sensorId":"sensor03"}
% Reached end of topic ksqldb_topic_01 [0] at offset 215
{"city":"Sevilla","temperature":10,"sensorId":"sensor01"}
% Reached end of topic ksqldb_topic_01 [0] at offset 216
{"city":"Valencia","temperature":15,"sensorId":"sensor03"}
% Reached end of topic ksqldb_topic_01 [0] at offset 217
{"city":"Sevilla","temperature":15,"sensorId":"sensor01"}
% Reached end of topic ksqldb_topic_01 [0] at offset 218
{"city":"Madrid","temperature":20,"sensorId":"sensor03"}
% Reached end of topic ksqldb_topic_01 [0] at offset 219
{"city":"Valencia","temperature":15,"sensorId":"sensor02"}
% Reached end of topic ksqldb_topic_01 [0] at offset 220
{"city":"Sevilla","temperature":5,"sensorId":"sensor02"}
% Reached end of topic ksqldb_topic_01 [0] at offset 221
{"city":"Sevilla","temperature":5,"sensorId":"sensor01"}
% Reached end of topic ksqldb_topic_01 [0] at offset 222
And I want to save in a table the last value that enters me in the topic, for each city and sensorId
In my ksqldDB I create the following table:
CREATE TABLE ultimo_resgistro(city VARCHAR,sensorId VARCHAR,temperature INTEGER) WITH (KAFKA_TOPIC='ksqldb_topic_01', VALUE_FORMAT='json',KEY = 'sensorId,city');
DESCRIBE EXTENDED ULTIMO_RESGISTRO;
Name : ULTIMO_RESGISTRO
Type : TABLE
Key field : SENSORID
Key format : STRING
Timestamp field : Not set - using <ROWTIME>
Value format : JSON
Kafka topic : ksqldb_topic_01 (partitions: 1, replication: 1)
Field | Type
-----------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
CITY | VARCHAR(STRING)
SENSORID | VARCHAR(STRING)
TEMPERATURE | INTEGER
-----------------------------------------
Seeing that data is processing me
select * from ultimo_resgistro emit changes;
+------------------+------------------+------------------+------------------+------------------+
|ROWTIME |ROWKEY |CITY |SENSORID |TEMPERATURE |
+------------------+------------------+------------------+------------------+------------------+
key cannot be null
Query terminated
The problem is that you need to set the key of the Kafka message correctly. You also cannot specify two fields in the KEY clause. Read more about this here
Here's an example of how to do it.
First up, load test data:
kafkacat -b kafka-1:39092 -P -t ksqldb_topic_01 <<EOF
{"city":"Madrid","temperature":20,"sensorId":"sensor03"}
{"city":"Madrid","temperature":5,"sensorId":"sensor03"}
{"city":"Sevilla","temperature":10,"sensorId":"sensor01"}
{"city":"Sevilla","temperature":15,"sensorId":"sensor01"}
{"city":"Sevilla","temperature":20,"sensorId":"sensor03"}
{"city":"Sevilla","temperature":5,"sensorId":"sensor01"}
{"city":"Sevilla","temperature":5,"sensorId":"sensor02"}
{"city":"Valencia","temperature":15,"sensorId":"sensor02"}
{"city":"Valencia","temperature":15,"sensorId":"sensor03"}
EOF
Now in ksqlDB declare the schema over the topic - as a stream, because we need to repartition the data to add a key. If you control the producer to the topic then maybe you'd do this upstream and save a step.
CREATE STREAM sensor_data_raw (city VARCHAR, temperature DOUBLE, sensorId VARCHAR)
WITH (KAFKA_TOPIC='ksqldb_topic_01', VALUE_FORMAT='JSON');
Repartition the data based on the composite key.
SET 'auto.offset.reset' = 'earliest';
CREATE STREAM sensor_data_repartitioned WITH (VALUE_FORMAT='AVRO') AS
SELECT *
FROM sensor_data_raw
PARTITION BY city+sensorId;
Two things to note:
I'm taking the opportunity to reserialise into Avro - if you'd rather keep JSON throughout then just omit the WITH (VALUE_FORMAT clause.
When the data is repartitioned the ordering guarantees are lost, so in theory you may end up with events out of order after this.
At this point we can inspect the transformed topic:
ksql> PRINT SENSOR_DATA_REPARTITIONED FROM BEGINNING LIMIT 5;
Format:AVRO
1/24/20 9:55:54 AM UTC, Madridsensor03, {"CITY": "Madrid", "TEMPERATURE": 20.0, "SENSORID": "sensor03"}
1/24/20 9:55:54 AM UTC, Madridsensor03, {"CITY": "Madrid", "TEMPERATURE": 5.0, "SENSORID": "sensor03"}
1/24/20 9:55:54 AM UTC, Sevillasensor01, {"CITY": "Sevilla", "TEMPERATURE": 10.0, "SENSORID": "sensor01"}
1/24/20 9:55:54 AM UTC, Sevillasensor01, {"CITY": "Sevilla", "TEMPERATURE": 15.0, "SENSORID": "sensor01"}
1/24/20 9:55:54 AM UTC, Sevillasensor03, {"CITY": "Sevilla", "TEMPERATURE": 20.0, "SENSORID": "sensor03"}
Note that the key in the Kafka message (the second field, after the timestamp) is now set correctly, compared to the original data that had no key:
ksql> PRINT ksqldb_topic_01 FROM BEGINNING LIMIT 5;
Format:JSON
{"ROWTIME":1579859380123,"ROWKEY":"null","city":"Madrid","temperature":20,"sensorId":"sensor03"}
{"ROWTIME":1579859380123,"ROWKEY":"null","city":"Madrid","temperature":5,"sensorId":"sensor03"}
{"ROWTIME":1579859380123,"ROWKEY":"null","city":"Sevilla","temperature":10,"sensorId":"sensor01"}
{"ROWTIME":1579859380123,"ROWKEY":"null","city":"Sevilla","temperature":15,"sensorId":"sensor01"}
{"ROWTIME":1579859380123,"ROWKEY":"null","city":"Sevilla","temperature":20,"sensorId":"sensor03"}
Now we can declare a table over the repartitioned data. Since I'm using Avro now I don't have to reenter the schema. If I was using JSON I would need to enter it again as part of this DDL.
CREATE TABLE ultimo_resgistro WITH (KAFKA_TOPIC='SENSOR_DATA_REPARTITIONED', VALUE_FORMAT='AVRO');
The table's key is implicitly taken from the ROWKEY, which is the key of the Kafka message.
ksql> SELECT ROWKEY, CITY, SENSORID, TEMPERATURE FROM ULTIMO_RESGISTRO EMIT CHANGES;
+------------------+----------+----------+-------------+
|ROWKEY |CITY |SENSORID |TEMPERATURE |
+------------------+----------+----------+-------------+
|Madridsensor03 |Madrid |sensor03 |5.0 |
|Sevillasensor03 |Sevilla |sensor03 |20.0 |
|Sevillasensor01 |Sevilla |sensor01 |5.0 |
|Sevillasensor02 |Sevilla |sensor02 |5.0 |
|Valenciasensor02 |Valencia |sensor02 |15.0 |
|Valenciasensor03 |Valencia |sensor03 |15.0 |
If you want to take advantage of pull queries (in order to get the latest value) then you need to go and upvote (or contribute a PR 😁) this issue.

Resources