Save last record according to a composite key. ksqlDB 0.6.0 - ksqldb

I have a Kafka topic with the following data flow ( ksqldb_topic_01 ):
% Reached end of topic ksqldb_topic_01 [0] at offset 213
{"city":"Sevilla","temperature":20,"sensorId":"sensor03"}
% Reached end of topic ksqldb_topic_01 [0] at offset 214
{"city":"Madrid","temperature":5,"sensorId":"sensor03"}
% Reached end of topic ksqldb_topic_01 [0] at offset 215
{"city":"Sevilla","temperature":10,"sensorId":"sensor01"}
% Reached end of topic ksqldb_topic_01 [0] at offset 216
{"city":"Valencia","temperature":15,"sensorId":"sensor03"}
% Reached end of topic ksqldb_topic_01 [0] at offset 217
{"city":"Sevilla","temperature":15,"sensorId":"sensor01"}
% Reached end of topic ksqldb_topic_01 [0] at offset 218
{"city":"Madrid","temperature":20,"sensorId":"sensor03"}
% Reached end of topic ksqldb_topic_01 [0] at offset 219
{"city":"Valencia","temperature":15,"sensorId":"sensor02"}
% Reached end of topic ksqldb_topic_01 [0] at offset 220
{"city":"Sevilla","temperature":5,"sensorId":"sensor02"}
% Reached end of topic ksqldb_topic_01 [0] at offset 221
{"city":"Sevilla","temperature":5,"sensorId":"sensor01"}
% Reached end of topic ksqldb_topic_01 [0] at offset 222
And I want to save in a table the last value that enters me in the topic, for each city and sensorId
In my ksqldDB I create the following table:
CREATE TABLE ultimo_resgistro(city VARCHAR,sensorId VARCHAR,temperature INTEGER) WITH (KAFKA_TOPIC='ksqldb_topic_01', VALUE_FORMAT='json',KEY = 'sensorId,city');
DESCRIBE EXTENDED ULTIMO_RESGISTRO;
Name : ULTIMO_RESGISTRO
Type : TABLE
Key field : SENSORID
Key format : STRING
Timestamp field : Not set - using <ROWTIME>
Value format : JSON
Kafka topic : ksqldb_topic_01 (partitions: 1, replication: 1)
Field | Type
-----------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
CITY | VARCHAR(STRING)
SENSORID | VARCHAR(STRING)
TEMPERATURE | INTEGER
-----------------------------------------
Seeing that data is processing me
select * from ultimo_resgistro emit changes;
+------------------+------------------+------------------+------------------+------------------+
|ROWTIME |ROWKEY |CITY |SENSORID |TEMPERATURE |
+------------------+------------------+------------------+------------------+------------------+
key cannot be null
Query terminated

The problem is that you need to set the key of the Kafka message correctly. You also cannot specify two fields in the KEY clause. Read more about this here
Here's an example of how to do it.
First up, load test data:
kafkacat -b kafka-1:39092 -P -t ksqldb_topic_01 <<EOF
{"city":"Madrid","temperature":20,"sensorId":"sensor03"}
{"city":"Madrid","temperature":5,"sensorId":"sensor03"}
{"city":"Sevilla","temperature":10,"sensorId":"sensor01"}
{"city":"Sevilla","temperature":15,"sensorId":"sensor01"}
{"city":"Sevilla","temperature":20,"sensorId":"sensor03"}
{"city":"Sevilla","temperature":5,"sensorId":"sensor01"}
{"city":"Sevilla","temperature":5,"sensorId":"sensor02"}
{"city":"Valencia","temperature":15,"sensorId":"sensor02"}
{"city":"Valencia","temperature":15,"sensorId":"sensor03"}
EOF
Now in ksqlDB declare the schema over the topic - as a stream, because we need to repartition the data to add a key. If you control the producer to the topic then maybe you'd do this upstream and save a step.
CREATE STREAM sensor_data_raw (city VARCHAR, temperature DOUBLE, sensorId VARCHAR)
WITH (KAFKA_TOPIC='ksqldb_topic_01', VALUE_FORMAT='JSON');
Repartition the data based on the composite key.
SET 'auto.offset.reset' = 'earliest';
CREATE STREAM sensor_data_repartitioned WITH (VALUE_FORMAT='AVRO') AS
SELECT *
FROM sensor_data_raw
PARTITION BY city+sensorId;
Two things to note:
I'm taking the opportunity to reserialise into Avro - if you'd rather keep JSON throughout then just omit the WITH (VALUE_FORMAT clause.
When the data is repartitioned the ordering guarantees are lost, so in theory you may end up with events out of order after this.
At this point we can inspect the transformed topic:
ksql> PRINT SENSOR_DATA_REPARTITIONED FROM BEGINNING LIMIT 5;
Format:AVRO
1/24/20 9:55:54 AM UTC, Madridsensor03, {"CITY": "Madrid", "TEMPERATURE": 20.0, "SENSORID": "sensor03"}
1/24/20 9:55:54 AM UTC, Madridsensor03, {"CITY": "Madrid", "TEMPERATURE": 5.0, "SENSORID": "sensor03"}
1/24/20 9:55:54 AM UTC, Sevillasensor01, {"CITY": "Sevilla", "TEMPERATURE": 10.0, "SENSORID": "sensor01"}
1/24/20 9:55:54 AM UTC, Sevillasensor01, {"CITY": "Sevilla", "TEMPERATURE": 15.0, "SENSORID": "sensor01"}
1/24/20 9:55:54 AM UTC, Sevillasensor03, {"CITY": "Sevilla", "TEMPERATURE": 20.0, "SENSORID": "sensor03"}
Note that the key in the Kafka message (the second field, after the timestamp) is now set correctly, compared to the original data that had no key:
ksql> PRINT ksqldb_topic_01 FROM BEGINNING LIMIT 5;
Format:JSON
{"ROWTIME":1579859380123,"ROWKEY":"null","city":"Madrid","temperature":20,"sensorId":"sensor03"}
{"ROWTIME":1579859380123,"ROWKEY":"null","city":"Madrid","temperature":5,"sensorId":"sensor03"}
{"ROWTIME":1579859380123,"ROWKEY":"null","city":"Sevilla","temperature":10,"sensorId":"sensor01"}
{"ROWTIME":1579859380123,"ROWKEY":"null","city":"Sevilla","temperature":15,"sensorId":"sensor01"}
{"ROWTIME":1579859380123,"ROWKEY":"null","city":"Sevilla","temperature":20,"sensorId":"sensor03"}
Now we can declare a table over the repartitioned data. Since I'm using Avro now I don't have to reenter the schema. If I was using JSON I would need to enter it again as part of this DDL.
CREATE TABLE ultimo_resgistro WITH (KAFKA_TOPIC='SENSOR_DATA_REPARTITIONED', VALUE_FORMAT='AVRO');
The table's key is implicitly taken from the ROWKEY, which is the key of the Kafka message.
ksql> SELECT ROWKEY, CITY, SENSORID, TEMPERATURE FROM ULTIMO_RESGISTRO EMIT CHANGES;
+------------------+----------+----------+-------------+
|ROWKEY |CITY |SENSORID |TEMPERATURE |
+------------------+----------+----------+-------------+
|Madridsensor03 |Madrid |sensor03 |5.0 |
|Sevillasensor03 |Sevilla |sensor03 |20.0 |
|Sevillasensor01 |Sevilla |sensor01 |5.0 |
|Sevillasensor02 |Sevilla |sensor02 |5.0 |
|Valenciasensor02 |Valencia |sensor02 |15.0 |
|Valenciasensor03 |Valencia |sensor03 |15.0 |
If you want to take advantage of pull queries (in order to get the latest value) then you need to go and upvote (or contribute a PR 😁) this issue.

Related

DataFlowRunner + Beam in streaming mode with a SideInput AsDict hangs

I have a simple graph that reads from a pubsub message (currently just a single string key), creates a very short window, generates 3 integers that use this key via a beam.ParDo, and a simple Map that creates a single "config" with this as a key.
Ultimately, there are 2 PCollections:
items: [('key', 0), ('key', 1), ...]
infos: [('key', 'the value is key')]
I want a final beam.Map over items that uses infos as a dictionary side input so I can look up the value in the dictionary.
Using the LocalRunner, the final print works with the side input.
On DataFlow the first two steps print, but the final Map with the side input never is called, presumably because it somehow is an unbounded window (despite the earlier window function).
I am using runner_v2, dataflow prime, and streaming engine.
p = beam.Pipeline(options=pipeline_options)
pubsub_message = (
p | beam.io.gcp.pubsub.ReadFromPubSub(
subscription='projects/myproject/testsubscription') |
'SourceWindow' >> beam.WindowInto(
beam.transforms.window.FixedWindows(1e-6),
trigger=beam.transforms.trigger.Repeatedly(beam.transforms.trigger.AfterCount(1)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING))
def _create_items(pubsub_key: bytes) -> Iterable[tuple[str, int]]:
for i in range(3):
yield pubsub_key.decode(), i
def _create_info(pubsub_key: bytes) -> tuple[str, str]:
return pubsub_key.decode(), f'the value is {pubsub_key.decode()}'
items = pubsub_message | 'CreateItems' >> beam.ParDo(_create_items) | beam.Reshuffle()
info = pubsub_message | 'CreateInfo' >> beam.Map(_create_info)
def _print_item(keyed_item: tuple[str, int], info_dict: dict[str, str]) -> None:
key, _ = keyed_item
log(key + '::' + info_dict[key])
_ = items | 'MapWithSideInput' >> beam.Map(_print_item, info_dict=beam.pvalue.AsDict(info))
Here is the output in local runner:
Creating item 0
Creating item 1
Creating item 2
Creating info b'key'
key::the value is key
key::the value is key
key::the value is key
Here is the DataFlow graph:
I've tried various windowing functions over the AsDict, but I can never get it to be exactly the same window as my input.
Thoughts on what I might be doing wrong here?

Error while reading data, error message: JSON table encountered too many errors, giving up. Rows

I am having two files and doing a inner join using CoGroupByKey in apache-beam.
When I am writing rows to bigquery,iy gives me following error.
RuntimeError: BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_614_c4a563c648634e9dbbf7be3a56578b6d_2f196decc8984a0d83dee92e19054ffb failed. Error Result: <ErrorProto
location: 'gs://dataflow4bigquery/temp/bq_load/06bfafaa9dbb47338ad4f3a9914279fe/dotted-transit-351803.test_dataflow.inner_join/f714c1ac-c234-4a37-bf51-c725a969347a'
message: 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.'
reason: 'invalid'> [while running 'WriteToBigQuery/BigQueryBatchFileLoads/WaitForDestinationLoadJobs']
-----------------code-----------------------
from apache_beam.io.gcp.internal.clients import bigquery
import apache_beam as beam
def retTuple(element):
thisTuple=element.split(',')
return (thisTuple[0],thisTuple[1:])
def jstr(cstr):
import datetime
left_dict=cstr[1]['dep_data']
right_dict=cstr[1]['loc_data']
for i in left_dict:
for j in right_dict:
id,name,rank,dept,dob,loc,city=([cstr[0]]+i+j)
json_str={ "id":id,"name":name,"rank":rank,"dept":dept,"dob":datetime.datetime.strptime(dob, "%d-%m-%Y").strftime("%Y-%m-%d").strip("'"),"loc":loc,"city":city }
return json_str
table_spec = 'dotted-transit-351803:test_dataflow.inner_join'
table_schema = 'id:INTEGER,name:STRING,rank:INTEGER,dept:STRING,dob:STRING,loc:INTEGER,city:STRING'
gcs='gs://dataflow4bigquery/temp/'
p1 = beam.Pipeline()
# Apply a ParDo to the PCollection "words" to compute lengths for each word.
dep_rows = (
p1
| "Reading File 1" >> beam.io.ReadFromText('dept_data.txt')
| 'Pair each employee with key' >> beam.Map(retTuple) # {149633CM : [Marco,10,Accounts,1-01-2019]}
)
loc_rows = (
p1
| "Reading File 2" >> beam.io.ReadFromText('location.txt')
| 'Pair each loc with key' >> beam.Map(retTuple) # {149633CM : [9876843261,New York]}
)
results = ({'dep_data': dep_rows, 'loc_data': loc_rows}
| beam.CoGroupByKey()
| beam.Map(jstr)
| beam.io.WriteToBigQuery(
custom_gcs_temp_location=gcs,
table=table_spec,
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
additional_bq_parameters={'timePartitioning': {'type': 'DAY'}}
)
)
p1.run().wait_until_finish()
I am running it on gcp using dataflow runner.
When printing json_str string the output is a valid json.
Eg:
{'id': '149633CM', 'name': 'Marco', 'rank': '10', 'dept': 'Accounts', 'dob': '2019-01-31', 'loc': '9204232778', 'city': 'New York'}
{'id': '212539MU', 'name': 'Rebekah', 'rank': '10', 'dept': 'Accounts', 'dob': '2019-01-31', 'loc': '9995440673', 'city': 'Denver'}
Schema which I have defined is also correct.
But,getting that error,when loading it to bigquery.
After doing some reaseach,I finally solved it.
It was a schema error.
Id column value is like 149633CM
I had given data type of Id as INTEGER,but when I tried to load json with bq and schema as --autodetect, bq marked datatype of Id as STRING.
after that,I changed datatype of Id column as STRING in my schema in code.
And,it worked.The table is created and got loaded.
But,I am not getting one thing,if starting 6 characters are numbers in Id column,why INTEGER is not wroking and STRING is working?
because the data type is parsed on the whole field value and not only the first 6 characters. if you drop the last 2 characters, you can put INTEGER

grep: extract content between prefix and suffix

I've a file content like this:
Listening for transport dt_socket at address: 8000
------------------------------------------------------------
🔥 ^[[1m HAPI FHIR^[[22m 5.4.0 - Command Line Tool
------------------------------------------------------------
Process ID : 21719#psgd
Max configured JVM memory (Xmx) : 3.2GB
Detected Java version : 11.0.7
------------------------------------------------------------
^[[32m2021-07-01^[[0;39m ^[[1;32m12:27:40.79^[[0;39m ^[[37m[main]^[[0;39m ^[[37mWARN ^[[0;39m ^[[1;34mo.f.c.i.s.c.ClassPathScanner^[[0;39m ^[[1;37mUnable to resolve location classpath:db/migration. Note this warning will become an error in Flyway 7.
^[[0;39m^[[32m2021-07-01^[[0;39m ^[[1;32m12:27:42.641^[[0;39m ^[[37m[main]^[[0;39m ^[[37mWARN ^[[0;39m ^[[1;34mo.f.c.i.s.c.ClassPathScanner^[[0;39m ^[[1;37mUnable to resolve location classpath:db/migration. Note this warning will become an error in Flyway 7.
^[[0;39m^[[32m2021-07-01^[[0;39m ^[[1;32m12:27:44.693^[[0;39m ^[[37m[main]^[[0;39m ^[[37mINFO ^[[0;39m ^[[1;34mc.u.f.j.m.t.InitializeSchemaTask^[[0;39m ^[[1;37m3_3_0.20180115.0: Initializing ORACLE_12C schema for HAPI FHIR
^[[0;39m^[[32m2021-07-01^[[0;39m ^[[1;32m12:27:44.848^[[0;39m ^[[37m[main]^[[0;39m ^[[37mINFO ^[[0;39m ^[[1;34mc.u.f.j.m.t.BaseTask^[[0;39m ^[[1;37m3_3_0.20180115.0: SQL "create sequence SEQ_BLKEXCOL_PID start with 1 increment by 50" returned 0
^[[0;39m^[[32m2021-07-01^[[0;39m ^[[1;32m12:27:44.918^[[0;39m ^[[37m[main]^[[0;39m ^[[37mINFO ^[[0;39m ^[[1;34mc.u.f.j.m.t.BaseTask^[[0;39m ^[[1;37m3_3_0.20180115.0: SQL "
create sequence SEQ_BLKEXCOLFILE_PID start with 1 increment by 50" returned 0
^[[0;39m^[[32m2021-07-01^[[0;39m ^[[1;32m12:27:47.573^[[0;39m ^[[37m[main]^[[0;39m ^[[37mINFO ^[[0;39m ^[[1;34mc.u.f.j.m.t.BaseTask^[[0;39m ^[[1;37m3_3_0.20180115.0: SQL "
create table HFJ_BINARY_STORAGE_BLOB (
BLOB_ID varchar2(200 char) not null,
BLOB_DATA blob not null,
CONTENT_TYPE varchar2(100 char) not null,
BLOB_HASH varchar2(128 char),
PUBLISHED_DATE timestamp not null,
RESOURCE_ID varchar2(100 char) not null,
BLOB_SIZE number(10,0),
primary key (BLOB_ID)
)" returned 0
I need to extract only content inside between SQL " and " returned 0 trimming all whitespaces.
Any ideas?
I've tried to reduce problem using:
$ echo 'sdf SQL" sdf sdf" returned 0' | grep 's/SQL"\(.*\)" returned 0/\1/' -
But it's getting empty.
My expected output is:
create sequence SEQ_BLKEXCOL_PID start with 1 increment by 50;
create sequence SEQ_BLKEXCOLFILE_PID start with 1 increment by 50;
create table HFJ_BINARY_STORAGE_BLOB (
BLOB_ID varchar2(200 char) not null,
BLOB_DATA blob not null,
CONTENT_TYPE varchar2(100 char) not null,
BLOB_HASH varchar2(128 char),
PUBLISHED_DATE timestamp not null,
RESOURCE_ID varchar2(100 char) not null,
BLOB_SIZE number(10,0),
primary key (BLOB_ID)
);
I've tried to perform:
cat test.log | sed -E 's/.* SQL"(.*)" returned 0/\1/'
It's returning me all file content...
Using awk, it returns empty:
$ awk -v RS='SQL "[[:space:]]+?\n\n+.*returned 0' '
RT{
gsub(/^SQL "\n+|\n+$/,"",RT)
sub(/" returned 0[[:space:]]+?\n*$/,"",RT)
print RT";"
}
' test.log
This can be done using custom RS in gnu-awk that splits data on SQL "..." text block and then inside action block it extracts text between quotes without leading space.
awk -v RS=' SQL "[^"]+"' 'RT {
gsub(/^[^"]*"[[:space:]]*|"[^"]*$/, "", RT); print RT ";"}' file.sql
create sequence SEQ_BLKEXCOL_PID start with 1 increment by 50;
create sequence SEQ_BLKEXCOLFILE_PID start with 1 increment by 50;
create table HFJ_BINARY_STORAGE_BLOB (
BLOB_ID varchar2(200 char) not null,
BLOB_DATA blob not null,
CONTENT_TYPE varchar2(100 char) not null,
BLOB_HASH varchar2(128 char),
PUBLISHED_DATE timestamp not null,
RESOURCE_ID varchar2(100 char) not null,
BLOB_SIZE number(10,0),
primary key (BLOB_ID)
);
With GNU awk, with your shown samples, please try following code once.
awk -v RS='SQL "[[:space:]]*\n\n+.*returned 0' '
RT{
gsub(/^SQL "\n+|\n+$/,"",RT)
sub(/" returned 0[[:space:]]*\n*$/,"",RT)
print RT";"
}
' Input_file
Explanation: Simple explanation would be, setting RS as SQL "[[:space:]]+?\n\n+.*returned 0 for awk program, removing not required strings like SQL " with new lines and returned 0 at last of value before printing it.
Explanation of regex is as follows: match SQL followed by space " followed by 1 or more spaces(optional) followed by 1 or more new lines till returned 0 here.
If you have gnu grep then you can use this PCRE regex:
grep -oPz ' SQL "\K[^"]+' file.sql
create sequence SEQ_BLKEXCOL_PID start with 1 increment by 50
create sequence SEQ_BLKEXCOLFILE_PID start with 1 increment by 50
create table HFJ_BINARY_STORAGE_BLOB (
BLOB_ID varchar2(200 char) not null,
BLOB_DATA blob not null,
CONTENT_TYPE varchar2(100 char) not null,
BLOB_HASH varchar2(128 char),
PUBLISHED_DATE timestamp not null,
RESOURCE_ID varchar2(100 char) not null,
BLOB_SIZE number(10,0),
primary key (BLOB_ID)
)
Explanation:
' SQL ": Search SQL " text
\K: Reset matched info
[^"]+: Match 1+ of any characters that are not "
To get formatting as desired (in comments) use this grep + sed (gnu) solution:
grep -oPzZ ' SQL "\K[^"]+' file.sql |
sed -E '$s/$/\n/; s/\x0/;/; s/^[[:blank:]]+//'
create sequence SEQ_BLKEXCOL_PID start with 1 increment by 50;
create sequence SEQ_BLKEXCOLFILE_PID start with 1 increment by 50;
create table HFJ_BINARY_STORAGE_BLOB (
BLOB_ID varchar2(200 char) not null,
BLOB_DATA blob not null,
CONTENT_TYPE varchar2(100 char) not null,
BLOB_HASH varchar2(128 char),
PUBLISHED_DATE timestamp not null,
RESOURCE_ID varchar2(100 char) not null,
BLOB_SIZE number(10,0),
primary key (BLOB_ID)
);

Dataflow stream python windowing

i am new in using dataflow. I have following logic :
Event is added to pubsub
Dataflow reads pubsub and gets the event
From event i am looking into MySQL to find relations in which segments this event have relation and list of relations is returned with this step. This segments are independent from one another.
Each segment can be divided to two tables in MySQL results for email and mobile and they are independent as well.
Each segment have rules that can be 1 to n . I would like to process this step in parallel and collect all results. I have tried to use Windows but i am not sure how to write the logic so when i get the combined results from all rules inside one segment all of them will be collected at end function and write the final logic inside MySQL depending from rule results ( boolean ).
Here is so far what i have :
testP = beam.Pipeline(options=options)
ReadData = (
testP | 'ReadData' >> beam.io.ReadFromPubSub(subscription=str(options.pubsubsubscriber.get())).with_output_types(bytes)
| 'Decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'GetSegments' >> beam.ParDo(getsegments(options))
)
processEmails = (ReadData
| 'GetSubscribersWithRulesForEmails' >> beam.ParDo(GetSubscribersWithRules(options, 'email'))
| 'ProcessSubscribersSegmentsForEmails' >> beam.ParDo(ProcessSubscribersSegments(options, 'email'))
)
processMobiles = (ReadData
| 'GetSubscribersWithRulesForMobiles' >> beam.ParDo(GetSubscribersWithRules(options, 'mobile'))
| 'ProcessSubscribersSegmentsForMobiles' >> beam.ParDo(ProcessSubscribersSegments(options, 'mobile'))
)
#for sake of testing only window for email is written
windowThis = (processEmails
| beam.WindowInto(
beam.window.FixedWindows(1),
trigger=beam.transforms.trigger.Repeatedly(
beam.transforms.trigger.AfterProcessingTime(1 * 10)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING)
| beam.CombinePerKey(beam.combiners.ToListCombineFn())
| beam.ParDo(print_windows)
)
In this case, because all of your elements have the exact same timestamp, I would use their message ID, and their timestamp to group them with Session windows. It would be something like this:
testP = beam.Pipeline(options=options)
ReadData = (
testP | 'ReadData' >> beam.io.ReadFromPubSub(subscription=str(options.pubsubsubscriber.get())).with_output_types(bytes)
| 'Decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'GetSegments' >> beam.ParDo(getsegments(options))
)
# At this point, ReadData contains (key, value) pairs with a timestamp.
# (Now we perform all of the processing
processEmails = (ReadData | ....)
processMobiles = (ReadData | .....)
# Now we window by sessions with a 1-second gap. This is okay because all of
# the elements for any given key have the exact same timestamp.
windowThis = (processEmails
| beam.WindowInto(beam.window.Sessions(1)) # Default trigger is fine
| beam.CombinePerKey(beam.combiners.ToListCombineFn())
| beam.ParDo(print_windows)
)

AQL different results from stream UDF depending on output style (table, json)

I'm trying to create aggregation (map | reduce) with UDF but something is wrong on the very begining. In Aerospike I have a set with bin 'u' (secondary index) and bin 'v' which is a list of objects (auctions with transactions lists and other auction data) and I have a stream UDF to aggregate internal structure of 'v':
function trans_sum_by_years(s)
local function transform(rec)
local l = map()
local x = map()
local trans, auctions = 0, 0
for i in list.iterator(rec['v'] or list()) do
auctions = auctions + 1
for t in list.iterator(i['t'] or list()) do
trans = trans + 1
date = os.date("*t", t['ts'])
if l[date['year']] ~= nil then
l[date['year']] = l[date['year']] + t['price'] * t['qty']
else
l[date['year']] = t['price'] * t['qty']
end
end
end
x.auctions = auctions
x.trans = trans
x.v = l
return x
end
return s : map(transform)
end
The problem is that output is very diffrent depending on setting output on table or json. In first case it seems everything is OK:
{"trans":594, "auctions":15, "v":{2010:1131030}}
{"trans":468, "auctions":68, "v":{2011:1472976, 2012:5188}}
......
On second I get empty object from internal record aggregation.
{
"trans_sum_b...": {
"trans": 389,
"auctions": 89,
"v": {}
}
},
{
"trans_sum_b...": {
"trans": 542,
"auctions": 30,
"v": {}
}
}
.....
I prefer json output and wasted couple hours to find out why I get empty 'v' field without success. So my question is "what the hell is going on" ;-) If my code is correct, what is wrong with the json output, that I don't see the results. If my code is wrong, why it's wrong and why table output results with what I need.
#user1875438 Your code is correct. It seems that there is bug in aql.
My result is the same as yours, the field of v is empty when using json mode.
I used tcpdump to grab the responses of aerospike-server when running these two commands, and found out the responses are the same, so I think it's very possible there is bug in aql tool.
159 0x0050: 0001 0000 0027 0113 0007 5355 4343 4553 .....'....SUCCES
160 0x0060: 5383 a603 7472 616e 7301 a903 6175 6374 S...trans...auct
161 0x0070: 696f 6e73 01a2 0376 81cd 07ce 01 ions...v.....
162 01:57:38.255065 IP localhost.hbci > localhost.57731: Flags [P.], seq 98:128, ack 144, win 42853, options [nop,nop,TS val 976630236 ecr 976630223], length 30
163 0x0000: 4500 0052 55f8 4000 4006 0000 7f00 0001 E..RU.#.#.......
I just posted an issue here.
The answer is simple as hell. But I'm new in Aerospike/Lua and I don't trust my knowledge so I searched for error everywhere but within AQL/UDF area. The problem is more fundamental and interferes with the specification of the JSON itself.
Keys in JSON have to be strings! So tostring(date['year']) solves problem.
Other question is does it is a bug or a feature :-) If Aerospike's map type allow integer keys should there be automatic key conversion from integer to string to satisfy JSON specification or not? IMHO there should be but probably some people disagree claiming that map type is not for integer keys...

Resources