Creating an avro schema for an array with multiple record types? - avro

I am creating an avro schema for a JSON payload that appear to have an array of multiple objects. I'm not sure exactly how to represent this in the schema. The key in question is content:
{
"id": "channel-id",
"name": "My Channel with a New Title",
"description": "Herpy me derpy merpus herpsum ner berp berps derp ter tee",
"privacyLevel": "<private|org>",
"planId": "some-plan-id",
"owner": "a-user-handle",
"curators": [
"user-handle-1",
"user-handle-2"
],
"members": 5,
"content": [
{
"id": "docker",
"slug": "docker",
"index": 1,
"type": "path"
},
{
"id": "such-linkage",
"slug": "such-linkage",
"index": 2,
"type": "external-link",
"details": {
"url": "http://some-dank-link.com",
"title": "My Dank Link",
"contentType": "External Link",
"level": "Beginner",
"duration": "PT34293H33M9S"
}
},
{
"id": "21f1e812-b10a-40df-8b52-3a1d05fc215c",
"slug": "windows-azure-storage-in-depth",
"index": 3,
"type": "course"
},
{
"id": "7c346c05-6416-42dd-80b2-d5e758de7926",
"slug": "7c346c05-6416-42dd-80b2-d5e758de7926",
"index": 4,
"type": "project"
}
],
"imageUrls": ["https://url/to/an/image", "https://url/to/another/image"],
"analyticsEnabled": true,
"orgDiscoverable": false,
"createdDate": "2015-12-31T01:23:45+00:00",
"archiveDate": "2015-12-31T01:23:45+00:00",
"messagePublishedAt": "2015-12-31T01:23:45+00:00"
}

If you are asking if it is possible create an array with different kind of records, it is. Avro support this through union. it would looks like .
{
"name": "myRecord",
"type":"record",
"fields":[
{
"name":"myArrayWithMultiplesTypes",
"type":{
"type": "array",
"items":[
{
"name":"typeOne",
"type":"record",
"fields":[
{"name":"name", "type":"string"}
]
},
{
"name":"typeTwo",
"type":"record",
"fields":[
{"name":"id", "type":"int"}
]
}
]
}
}
]
}

If you already have the records defined previously, then it could look like this:
{
"name": "mulitplePossibleTypes",
"type": [
"null",
{
"type": "array",
"items": [
"com.xyz.kola.cloud.events.itemmanager.Part",
"com.xyz.kola.cloud.events.itemmanager.Document",
"com.xyz.kola.cloud.events.itemmanager.DigitalModel",
"com.xyz.kola.cloud.events.itemmanager.Interface"
]
}
]
},

Related

Data creation Error creating a kafka message to producer - Expected start-union. Got VALUE_STRING [duplicate]

Unable to Error creating a kafka message to producer - Expected start-union. Got VALUE_STRING
{
"namespace": "de.morris.audit",
"type": "record",
"name": "AuditDataChangemorris",
"fields": [
{"name": "employeeID", "type": "string"},
{"name": "employeeNumber", "type": ["null", "string"], "default": null},
{"name": "serialNumbers", "type": [ "null", {"type": "array", "items": "string"}]},
{"name": "correlationId", "type": "string"},
{"name": "timestamp", "type": "long", "logicalType": "timestamp-millis"},
{"name": "employmentscreening","type":{"type": "enum", "name": "employmentscreening", "symbols": ["NO","YES"]}},
{"name": "vouchercodes","type": ["null",
{
"type": "array",
"items": {
"name": "Vouchercodes",
"type": "record",
"fields": [
{"name": "voucherName","type": ["null","string"], "default": null},
{"name": "authocode","type": ["null","string"], "default": null}
]
}
}], "default": null}
]
}
when i was trying to create a sample data in json format based on the above avsc for kafka consumer i am getting the below error upon testing
{
"employeeID": "qtete46524",
"employeeNumber": {
"string": "custnumber9813"
},
"serialNumbers": {
"type": "array",
"items": ["363536623","5846373733"]
},
"correlationId": "corr-656532443",
"timestamp": 1476538955719,
"employmentscreening": "NO",
"vouchercodes": [
{
"voucherName": "skygo",
"authocode": "A238472ASD"
}
]
}
getting the below error when i got when i ran the dataflow job in gcp
Error message from worker: java.lang.RuntimeException: java.io.IOException: Insert failed: [{"errors":[{"debugInfo":"","location":"serialnumbers","message":"Array specified for non-repeated field: serialnumbers.","reason":"invalid"}],"index":0}]**
how to create correct sample data based on the above schema ?
Read the spec
The value of a union is encoded in JSON as follows:
if its type is null, then it is encoded as a JSON null;
otherwise it is encoded as a JSON object with one name/value pair whose name is the type’s name and whose value is the recursively encoded value
So, here's the data it expects.
{
"employeeID": "qtete46524",
"employeeNumber": {
"string": "custnumber9813"
},
"serialNumbers": {"array": [
"serialNumbers3521"
]},
"correlationId": "corr-656532443",
"timestamp": 1476538955719,
"employmentscreening": "NO",
"vouchercodes": {"array": [
{
"voucherName": {"string": "skygo"},
"authocode": {"string": "A238472ASD"}
}
]}
}
With this schema
{
"namespace": "de.morris.audit",
"type": "record",
"name": "AuditDataChangemorris",
"fields": [
{
"name": "employeeID",
"type": "string"
},
{
"name": "employeeNumber",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "serialNumbers",
"type": [
"null",
{
"type": "array",
"items": "string"
}
]
},
{
"name": "correlationId",
"type": "string"
},
{
"name": "timestamp",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
}
},
{
"name": "employmentscreening",
"type": {
"type": "enum",
"name": "employmentscreening",
"symbols": [
"NO",
"YES"
]
}
},
{
"name": "vouchercodes",
"type": [
"null",
{
"type": "array",
"items": {
"name": "Vouchercodes",
"type": "record",
"fields": [
{
"name": "voucherName",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "authocode",
"type": [
"null",
"string"
],
"default": null
}
]
}
}
],
"default": null
}
]
}
Here's an example of producing and consuming to Kafka
$ jq -rc < /tmp/data.json | kafka-avro-console-producer --topic foobar --property value.schema="$(jq -rc < /tmp/data.avsc)" --bootstrap-server localhost:9092 --sync
$ kafka-avro-console-consumer --topic foobar --from-beginning --bootstrap-server localhost:9092 | jq
{
"employeeID": "qtete46524",
"employeeNumber": {
"string": "custnumber9813"
},
"serialNumbers": {
"array": [
"serialNumbers3521"
]
},
"correlationId": "corr-656532443",
"timestamp": 1476538955719,
"employmentscreening": "NO",
"vouchercodes": {
"array": [
{
"voucherName": {
"string": "skygo"
},
"authocode": {
"string": "A238472ASD"
}
}
]
}
}
^CProcessed a total of 1 messages

AvroSerializer: schema for orderbook snapshots

I have a Kafka cluster running and I want to store L2-orderbook snapshots into a topic that have a dictionary of {key:value} pairs where the keys are of type float as the following example:
{
'exchange': 'ex1',
'symbol': 'sym1',
'book': {
'bid': {
100.0: 20.0,
101.0: 21.3,
102.0: 34.6,
...,
},
'ask': {
100.0: 20.0,
101.0: 21.3,
102.0: 34.6,
...,
}
},
'timestamp': 1642524222.1160505
}
My schema proposal below is not working and I'm pretty sure it is because the keys in the 'bid' and 'ask' dictionaries are not of type string.
{
"namespace": "confluent.io.examples.serialization.avro",
"name": "L2_Book",
"type": "record",
"fields": [
{"name": "exchange", "type": "string"},
{"name": "symbol", "type": "string"},
{"name": "book", "type": "record", "fields": {
"name": "bid", "type": "record", "fields": {
{"name": "price", "type": "float"},
{"name": "volume", "type": "float"}
},
"name": "ask", "type": "record", "fields": {
{"name": "price", "type": "float"},
{"name": "volume", "type": "float"}
}
},
{"name": "timestamp", "type": "float"}
]
}
KafkaError{code=_VALUE_SERIALIZATION,val=-161,str="no value and no default for bids"}
What would be a proper avro-schema here?
First, you have a typo. fields needs to be an array in the schema definition.
However, your bid (and ask) objects are not records. They are a map<float, float>. In other words, it does not have literal price and volume keys.
Avro has Map types, but the keys are "assumed to be strings".
You are welcome to try
{"name": "bid", "type": "map", "values": "float"}
Otherwise, you need to reformat your data payloads, for example as a list of objects
'bid': [
{'price': 100.0, 'volume': 20.0},
...,
],
Along with
{"name": "bid", "type": "array", "items": {
"type": "record",
"name": "BidItem",
"fields": [
{"name": "price", "type": "float"},
{"name": "volume", "type": "float"}
]
}}
I have finally figured out 2 working resolutions. In both cases I need to convert the original data.
The main lessons for me have been:
avro maps need keys of type string
avro complex types (e.g. maps and records) need to be defined properly:
{"name": "bid", "type"
{"type": "array", "items": {
...
Special thanks to OneCricketeer for pointing me into the right direction! :-)
1) bids and asks as a map with the key being of type string
data example
{
'exchange': 'ex1',
'symbol': 'sym1',
'book': {
'bid': {
"100.0": 20.0,
"101.0": 21.3,
"102.0": 34.6,
...,
},
'ask': {
"100.0": 20.0,
"101.0": 21.3,
"102.0": 34.6,
...,
}
},
'timestamp': 1642524222.1160505
}
schema
{
"namespace": "confluent.io.examples.serialization.avro",
"name": "L2_Book",
"type": "record",
"fields": [
{"name": "exchange", "type": "string"},
{"name": "symbol", "type": "string"},
{"name": "book", "type": {
"name": "book",
"type": "record",
"fields": [
{"name": "bid", "type": {
"type": "map", "values": "float"
}
},
{"name": "ask", "type": {
"type": "map", "values": "float"
}
}
]}
},
{"name": "timestamp", "type": "float"}
]
}
2) bids and asks as an array of records
data example
{
'exchange': 'ex1',
'symbol': 'sym1',
'book': {
'bid': [
{"price": 100.0, "volume": 20.0,}
{"price": 101.0, "volume": 21.3,}
{"price": 102.0, "volume": 34.6,}
...,
],
'ask': [
{"price": 100.0, "volume": 20.0,}
{"price": 101.0, "volume": 21.3,}
{"price": 102.0, "volume": 34.6,}
...,
]
},
'timestamp': 1642524222.1160505
}
schema
{
"namespace": "confluent.io.examples.serialization.avro",
"name": "L2_Book",
"type": "record",
"fields": [
{"name": "exchange", "type": "string"},
{"name": "symbol", "type": "string"},
{"name": "book", "type": {
"name": "book",
"type": "record",
"fields": [
{"name": "bid", "type": {
"type": "array", "items": {
"name": "bid",
"type": "record",
"fields": [
{"name": "price", "type": "float"},
{"name": "volume", "type": "float"}
]
}
}},
{"name": "ask", "type": {
"type": "array", "items": {
"name": "ask",
"type": "record",
"fields": [
{"name": "price", "type": "float"},
{"name": "volume", "type": "float"}
]
}
}}
]}},
{"name": "timestamp", "type": "float"}
]
}

Avro schema cannot deserialize autoregistered avro schema by connector

We are trying to consume a topic that has data emitted by a connector. We are using a handwritten schema that matches the data in the topic.
{
"type": "record",
"name": "Event",
"namespace": "com.example.avro",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "type",
"type": ["null", "string"],
"default": null
},
{
"name": "entity_id",
"type": ["null", "string"],
"default": null
},
{
"name": "emitted_at",
"type": ["null", "string"],
"default": null
},
{
"name": "data",
"type": ["null", "string"],
"default": null
}
]
}
Unfortunately it cannot deserialize this because of the auto-registered schema by the connector.
{
"type": "record",
"name": "Value",
"namespace": "postgres.public.events",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "type",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "entity_id",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "emitted_at",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.name": "io.debezium.time.ZonedTimestamp"
}
],
"default": null
},
{
"name": "data",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.name": "io.debezium.data.Json"
}
],
"default": null
}
],
"connect.name": "postgres.public.events.Value"
}
We are getting the following error:
Caused by: org.apache.kafka.common.errors.SerializationException: Could not find class postgres.public.events.Value specified in writer's schema whilst finding reader's schema for a SpecificRecord.
How do we resolve this issue?
You can either download the schema from the registry instead of defining your own (there's maven plugins to do this), or change the namespace+name of your own schema such that the generated class will match.
Adding an alias might work as well, but I've not had much experience/luck with that, personally.

Avro Nested array exception

I am trying to generate avro schema for nested array .
The top most array stores is the issue, however inner array Business is correct.
{"name": "Stores",
"type": {
"type": "array",
"items": {
"name": "Hours",
"type": "record",
"fields": [
{
"name": "Week",
"type": "string"
},
{"name": "Business",
"type":"array",
"items": {"name":"Business_record","type":"record","fields":[
{"name": "Day", "type":"string"},
{"name": "StartTime", "type": "string"},
{"name": "EndTime", "type": "string"}
]}
}
]
}
}
And the exception im getting is :
[ {
"level" : "fatal",
"message" : "illegal Avro schema",
"exceptionClass" : "org.apache.avro.SchemaParseException",
"exceptionMessage" : "No type: {\"name\":\"Stores\",\"type\":{\"type\":\"array\",\"items\":{\"name\":\"Hours\",\"type\":\"record\",\"fields\":[{\"name\":\"Week\",\"type\":\"string\"},{\"name\":\"Business\",\"type\":\"array\",\"items\":{\"name\":\"Business_record\",\"type\":\"record\",\"fields\":[{\"name\":\"Day\",\"type\":\"string\"},{\"name\":\"StartTime\",\"type\":\"string\"},{\"name\":\"EndTime\",\"type\":\"string\"}]}}]}}}",
"info" : "other messages follow (if any)"
} ]
I think something to do with [] Or{} for the outer array fields but I'm not able to figure it out.
Any help is appreciated.
I found the mistake i was doing:
when added the "type": for the nested array it worked for me.
{
"name": "Stores",
"type": "array",
"items": {
"name": "Hours",
"type": "record",
"fields": [
{
"name": "Week",
"type": "string"
},
{
"name": "Business",
"type": {
"type": "array",
"items": {
"name": "Business_record",
"type": "record",
"fields": [
{
"name": "Day",
"type": "string"
},
{
"name": "StartTime",
"type": "string"
},
{
"name": "EndTime",
"type": "string"
}
]
}
}
}
]
}
}

Avro schema issue when record missing a field

I am using the NiFi (v1.2) processor ConvertJSONToAvro. I am not able to parse a record that only contains 1 of 2 elements in a "record" type. This element is also allowed to be missing entirely from the data. Is my Avro schema incorrect?
Schema snippet:
"name": "personname",
"type": [
"null":,
{
"type": "record",
"name": "firstandorlast",
"fields": [
{
"name": "first",
"type": [
"null",
"string"
]
},
{
"name": "last",
"type": [
"null",
"string"
]
}
]
}
]
If "personname" contains both "first" and "last" it works, but if it only contains one of the elements, it fails with the error: Cannot convert field personname: cannot resolve union:
{ "last":"Smith" }
not in
"type": [ "null":,
{
"type": "record",
"name": "firstandorlast",
"fields": [
{
"name": "first",
"type": [
"null",
"string"
]
},
{
"name": "last",
"type": [
"null",
"string"
]
}
]
}
]
You are missing the default value
https://avro.apache.org/docs/1.8.1/spec.html#schema_record
Your schema should looks like
"name": "personname",
"type": [
"null":,
{
"type": "record",
"name": "firstandorlast",
"fields": [
{
"name": "first",
"type": [
"null",
"string"
],
"default": "null"
},
{
"name": "last",
"type": [
"null",
"string"
],
"default": "null"
}
]
}
]

Resources