Recursive avro schema type in schema registry - avro

I want to create an Avro schema for schema registry for the following Typescript code:
export type Value = {
[key: string]:
| Value
| Value[]
| string
| number;
};
It's a recursive map type. I know it is possible to create a recursive record like below, but it's a different use case.
export type Node = {
value: number;
leafs: Node[];
}
I tried different approaches, including named types and schema references, but all resulted in validation errors when publishing a schema.
A simplified schema (excluding the recursive array) that is desired but invalid looks like this:
{
"type": "record",
"name": "Value",
"namespace": "com.namespace",
"fields": [
{ "name": "itemValues", "type": { "type": "map", "values": ["string", "int", "itemValues"] } }
]
}
Most of variations of this schema result in an error: org.apache.avro.SchemaParseException: Undefined name: "itemValues"
I could not find examples of similar scenarios and wondered if it's even possible to create one like this? The limitation for that would most likely be the lack of named union and map types in Avro.
Update
An example JSON that I want to achieve:
{
"itemValues": {
"validA": "sth",
"validB": [],
"validC": 8,
"recursiveProperty": {
"anyMap": { "sth": "else" }
}
}
}

The problem with your schema is that your values list is ["string", "int", "itemValues"], but the parser is complaining because you have told it there should be some type itemValues and you haven't defined one. The only type you have defined is Value.
Here's the fixed schema (including adding the array of Value as one of the potential types:
{
"type": "record",
"name": "Value",
"namespace": "com.namespace",
"fields": [
{ "name": "itemValues", "type": { "type": "map", "values": ["string", "int", "Value", {"type": "array", "items": "Value"}] } }
]
}

Related

Avro Records with Unknown Key Names and Quantity but Known Values

I am looking to convert JSON to Avro without altering the shape of the data
A nested field in the JSON contains a variable number of keys, which are never known in advance. The record that branches off of each of these unknown nodes however is of known, well-defined shape.
An example of this input data is as shown below
{
"customers": {
"paul_ince": {
"name": "Mr Paul Ince",
"age": 54
},
"kim_kardashian": {
"name": "Ms Kim Kardashian",
"age": 41
},
"elon_musk": {
"name": "Elon Musk, Technoking of Tesla",
"age": 50
}
}
Now it would be more avro friendly of course to have customers lead to an array of the form
{
"customers": [
{
"customer_name": "paul_ince",
"name": "Mr Paul Ince",
"age": 54
},
{
...
}
]
}
But this would violate my constraint that the shape of the input data be unchanged.
This problem seems to manifest itself frequently if I rely on data from external sources or scrapes, or preexisting data that was never created for Avro.
In my head, the schema would look something like the below,
{
"fields": [
{
"name": "customers",
"type": {
"type": "record",
"name": "customers",
"fields": [
{
"name": $customer_name,
"type": {
"type": "record",
"name": $customer_name,
"fields": [
{
"name": "name",
"type": "string",
},
{
"name": "age",
"type": "int"
}
]
}
}
]
}
}
]
}
where $customer_name is an assignment value, defined on read. Just asking the question it feels like this violates fundamental avro but I must use Avro and I strongly desire to maintain the input shape of the data. It would be highly impractical to modify this, not least given how frequently this problem appears and how large and varied the data I need to transfer from JSON to Avro is

In Avro why must we specify a "null" string for correct Enum types?

I am completely new to Avro serialization and I am trying to get my head around how complex types are defined.
I am puzzled by how Avro generates the Enums in Java.
{
"type":"record",
"namespace":"com.example",
"name": "Customer",
"doc":"Avro Schema for our Customer",
"fields":[
{"name":"first_name","type":"string","doc":"First Name of Customer"},
{"name":"last_name","type":"string","doc":"Last Name of Customer"},
{"name":"automated_email","type":"boolean","doc":"true if the user wants marketing email", "default":true},
{
"name": "customer_type",
"type": ["null",
{
"name": "Customertype",
"type": "enum",
"symbols": ["OLD","NEW"]
}
]
}
]
}
Notice the customer_type field. If I give null, then in my generated sources I get the correct Enum type which is :
private com.example.Customertype customer_type;
But the moment I remove the null value and define customer_type in the following way:
{
"name": "customer_type",
"type": [
{
"name": "Customertype",
"type": "enum",
"symbols": ["OLD","NEW"]
}
]
}
The declaration changes to :
private Object customer_type;
What does that "null" string signify ? Why is it important ?
I have tried looking through the AVRO specification but nothing has given me a clear cut answer why this is working the way it is.
I am using the AVRO Maven plugin.
Any beginner resources for AVRO will also be appreciated.
Thank you.
If you are going to remove the null, you should remove the [ and ] brackets (because it is no longer a union).
So your customer_type schema should look like this:
{
"name": "customer_type",
"type": {
"name": "Customertype",
"type": "enum",
"symbols": ["OLD","NEW"]
}
}

Avro schema - map type as optional field

How do I set the arrayofmap in avro schema as optional field. The below schema is working, however, if this field is missing in the data, then the parsing is failing with org.apache.avro.AvroTypeException: Error converting field - quantities and.Caused by: org.apache.avro.AvroTypeException: Expected array-start. Got VALUE_NULL
` I just want to make sure the deserialization of data goes through whether the field is present in the data or not.
{
"name":"quantities",
"type":{
"items":{
"type":"map",
"values":"string"
},
"type":"array"
},
"default" : null,
}
Just found a solution myself. this will make the array of map fields optional in the avro schema
{
"name": "quantities",
"type": ["null",
{
"type": "array",
"items": {
"type": "map",
"values": "string"
}
}
],
"default": null,
}

Writing an array of multiple different Records to Avro format, into the same file

We have some legacy file format, which I would need to migrate to Avro storage. The tricky part is that the records basically have
some common fields,
a discriminator field and
some unique fields, specific to the type selected by the discriminator field
all of them stored in the same file, without any order, fully mixed with each other. (It's legacy...)
In Java/object-oriented programming, one could represent our records concept as the following:
abstract class RecordWithCommonFields {
private Long commonField1;
private String commonField2;
...
}
class RecordTypeA extends RecordWithCommonFields {
private Integer specificToA1;
private String specificToA1;
...
}
class RecordTypeB extends RecordWithCommonFields {
private Boolean specificToB1;
private String specificToB1;
...
}
Imagine the data being something like this:
commonField1Value;commonField2Value,TYPE_IS_A,specificToA1Value,specificToA1Value
commonField1Value;commonField2Value,TYPE_IS_B,specificToB1Value,specificToB1Value
So I would like to process an incoming file and write its content to Avro format, somehow representing the different types of the records.
Can someone give me some ideas on how to achieve this?
Nandor from the Avro users email list was kind enough to help me out with this answer, credits go to him; this answer is for the record just in case someone else hits the same issue.
His solution is simple, basically using composition rather than inheritance, by introducing a common container class and a field referencing a specific subclass.
With this approach the mapping looks like this:
{
"namespace": "com.foobar",
"name": "UnionRecords",
"type": "array",
"items": {
"type": "record",
"name": "RecordWithCommonFields",
"fields": [
{"name": "commonField1", "type": "string"},
{"name": "commonField2", "type": "string"},
{"name": "subtype", "type": [
{
"type" : "record",
"name": "RecordTypeA",
"fields" : [
{"name": "integerSpecificToA1", "type": ["null", "long"] },
{"name": "stringSpecificToA1", "type": ["null", "string"]}
]
},
{
"type" : "record",
"name": "RecordTypeB",
"fields" : [
{"name": "booleanSpecificToB1", "type": ["null", "boolean"]},
{"name": "stringSpecificToB1", "type": ["null", "string"]}
]
}
]}
]
}
}

Can i define nested array objects in swagger 2.0

We are using Swagger 2.0 for our documentation. We are Pro-grammatically creating swagger 2.0 spec straight out our data design documents.
Our Model is very complex and nested. I would like to understand can we define nested array objects defined inline.
for e.g :
{
"definitions": {
"user": {
"type": "object",
"required": ["name"],
"properties": {
"name": {
"type": "string"
},
"address": {
"type": "array",
"items": {
"type": "object",
"properties": {
"type": {
"type": "string",
"enum": ["home",
"office"]
},
"line1": {
"type": "string"
}
},
"Person": {
"type": "object",
"properties": {
"name": {
"type": "string"
}
}
}
}
}
}
}
}
}
We have many cases where we encounter this in our model and defining a #ref is not an option that we want to consider at this time. We need this to handled inline.
As per the following post : https://github.com/swagger-api/swagger-editor/issues/603#evenenter code heret-391465196 looks like its not supported to handle nested array objects defined inline.
Since lot of big enterprise's have a very complex data model we would like to have this this feature to be supported in swagger 2.0 spec.
Is there any thought on this feature to be added.
You document is just invalid and this is not about nested arrays: the property Person is not allowed in a Swagger 2.0 schema inside items.
The only allowed properties in a schema are: $ref, format, title, description, default, multipleOf, maximum, exclusiveMaximum, minimum, exclusiveMinimum, maxLength, minLength, pattern, maxItems, minItems, uniqueItems, maxProperties, minProperties, required, enum, additionalProperties, type, items, allOf, properties, discriminator, readOnly, xml, externalDocs, example.

Resources