In Avro why must we specify a "null" string for correct Enum types? - avro

I am completely new to Avro serialization and I am trying to get my head around how complex types are defined.
I am puzzled by how Avro generates the Enums in Java.
{
"type":"record",
"namespace":"com.example",
"name": "Customer",
"doc":"Avro Schema for our Customer",
"fields":[
{"name":"first_name","type":"string","doc":"First Name of Customer"},
{"name":"last_name","type":"string","doc":"Last Name of Customer"},
{"name":"automated_email","type":"boolean","doc":"true if the user wants marketing email", "default":true},
{
"name": "customer_type",
"type": ["null",
{
"name": "Customertype",
"type": "enum",
"symbols": ["OLD","NEW"]
}
]
}
]
}
Notice the customer_type field. If I give null, then in my generated sources I get the correct Enum type which is :
private com.example.Customertype customer_type;
But the moment I remove the null value and define customer_type in the following way:
{
"name": "customer_type",
"type": [
{
"name": "Customertype",
"type": "enum",
"symbols": ["OLD","NEW"]
}
]
}
The declaration changes to :
private Object customer_type;
What does that "null" string signify ? Why is it important ?
I have tried looking through the AVRO specification but nothing has given me a clear cut answer why this is working the way it is.
I am using the AVRO Maven plugin.
Any beginner resources for AVRO will also be appreciated.
Thank you.

If you are going to remove the null, you should remove the [ and ] brackets (because it is no longer a union).
So your customer_type schema should look like this:
{
"name": "customer_type",
"type": {
"name": "Customertype",
"type": "enum",
"symbols": ["OLD","NEW"]
}
}

Related

Avro Records with Unknown Key Names and Quantity but Known Values

I am looking to convert JSON to Avro without altering the shape of the data
A nested field in the JSON contains a variable number of keys, which are never known in advance. The record that branches off of each of these unknown nodes however is of known, well-defined shape.
An example of this input data is as shown below
{
"customers": {
"paul_ince": {
"name": "Mr Paul Ince",
"age": 54
},
"kim_kardashian": {
"name": "Ms Kim Kardashian",
"age": 41
},
"elon_musk": {
"name": "Elon Musk, Technoking of Tesla",
"age": 50
}
}
Now it would be more avro friendly of course to have customers lead to an array of the form
{
"customers": [
{
"customer_name": "paul_ince",
"name": "Mr Paul Ince",
"age": 54
},
{
...
}
]
}
But this would violate my constraint that the shape of the input data be unchanged.
This problem seems to manifest itself frequently if I rely on data from external sources or scrapes, or preexisting data that was never created for Avro.
In my head, the schema would look something like the below,
{
"fields": [
{
"name": "customers",
"type": {
"type": "record",
"name": "customers",
"fields": [
{
"name": $customer_name,
"type": {
"type": "record",
"name": $customer_name,
"fields": [
{
"name": "name",
"type": "string",
},
{
"name": "age",
"type": "int"
}
]
}
}
]
}
}
]
}
where $customer_name is an assignment value, defined on read. Just asking the question it feels like this violates fundamental avro but I must use Avro and I strongly desire to maintain the input shape of the data. It would be highly impractical to modify this, not least given how frequently this problem appears and how large and varied the data I need to transfer from JSON to Avro is

Avro schema - map type as optional field

How do I set the arrayofmap in avro schema as optional field. The below schema is working, however, if this field is missing in the data, then the parsing is failing with org.apache.avro.AvroTypeException: Error converting field - quantities and.Caused by: org.apache.avro.AvroTypeException: Expected array-start. Got VALUE_NULL
` I just want to make sure the deserialization of data goes through whether the field is present in the data or not.
{
"name":"quantities",
"type":{
"items":{
"type":"map",
"values":"string"
},
"type":"array"
},
"default" : null,
}
Just found a solution myself. this will make the array of map fields optional in the avro schema
{
"name": "quantities",
"type": ["null",
{
"type": "array",
"items": {
"type": "map",
"values": "string"
}
}
],
"default": null,
}

Writing an array of multiple different Records to Avro format, into the same file

We have some legacy file format, which I would need to migrate to Avro storage. The tricky part is that the records basically have
some common fields,
a discriminator field and
some unique fields, specific to the type selected by the discriminator field
all of them stored in the same file, without any order, fully mixed with each other. (It's legacy...)
In Java/object-oriented programming, one could represent our records concept as the following:
abstract class RecordWithCommonFields {
private Long commonField1;
private String commonField2;
...
}
class RecordTypeA extends RecordWithCommonFields {
private Integer specificToA1;
private String specificToA1;
...
}
class RecordTypeB extends RecordWithCommonFields {
private Boolean specificToB1;
private String specificToB1;
...
}
Imagine the data being something like this:
commonField1Value;commonField2Value,TYPE_IS_A,specificToA1Value,specificToA1Value
commonField1Value;commonField2Value,TYPE_IS_B,specificToB1Value,specificToB1Value
So I would like to process an incoming file and write its content to Avro format, somehow representing the different types of the records.
Can someone give me some ideas on how to achieve this?
Nandor from the Avro users email list was kind enough to help me out with this answer, credits go to him; this answer is for the record just in case someone else hits the same issue.
His solution is simple, basically using composition rather than inheritance, by introducing a common container class and a field referencing a specific subclass.
With this approach the mapping looks like this:
{
"namespace": "com.foobar",
"name": "UnionRecords",
"type": "array",
"items": {
"type": "record",
"name": "RecordWithCommonFields",
"fields": [
{"name": "commonField1", "type": "string"},
{"name": "commonField2", "type": "string"},
{"name": "subtype", "type": [
{
"type" : "record",
"name": "RecordTypeA",
"fields" : [
{"name": "integerSpecificToA1", "type": ["null", "long"] },
{"name": "stringSpecificToA1", "type": ["null", "string"]}
]
},
{
"type" : "record",
"name": "RecordTypeB",
"fields" : [
{"name": "booleanSpecificToB1", "type": ["null", "boolean"]},
{"name": "stringSpecificToB1", "type": ["null", "string"]}
]
}
]}
]
}
}

default values in avro schema not written to Java class

I have a simple schema as follows:
{
"name": "owner",
"type": "record",
"doc": "todo",
"fields": [
{ "name": "version", "type": "int", "default": 1},
{ "name": "name", "type": "string" },
{ "name": "age", "type": "int" },
]
}
However when I use the avro-maven-plugin to generate a java object from this specificaion, it does not set the default value of version field to 1.
How do I make that happen?
Nevermind, It works fine as is.
I was looking at the generated Java class, and could not figure out where it was setting the default value to 1. But when I serialize it using json, I see the default value in the output. Also, the getter returns the default value as well.

Does Avro support required fields?

i.e. is it possible to make field required similar to ProtoBuf:
message SearchRequest {
required string query = 1;
}
All fields are required in Avro by default. As is mentioned in the official documentation, if you want to make something optional, you have to make it nullable by unioning its type with null, like this
{ "namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
In this example, name is required, favorite_number and favorite_color are optional. I recommend spending some more time with the documentation.

Resources