Avro schema evolution - avro

I have two questions:
Is it possible to use the same reader and parse records that were written with two schemas that are compatible, e.g. Schema V2 only has an additional optional field compared to Schema V1 and I want the reader to understand both? I think the answer here is no, but if yes, how do I do that?
I have tried writing a record with Schema V1 and reading it with Schema V2, but I get the following error:
org.apache.avro.AvroTypeException: Found foo, expecting foo
I used avro-1.7.3 and:
writer = new GenericDatumWriter<GenericData.Record>(SchemaV1);
reader = new GenericDatumReader<GenericData.Record>(SchemaV2, SchemaV1);
Here are examples of the two schemas (I have tried adding a namespace as well, but no luck).
Schema V1:
{
"name": "foo",
"type": "record",
"fields": [{
"name": "products",
"type": {
"type": "array",
"items": {
"name": "product",
"type": "record",
"fields": [{
"name": "a1",
"type": "string"
}, {
"name": "a2",
"type": {"type": "fixed", "name": "a3", "size": 1}
}, {
"name": "a4",
"type": "int"
}, {
"name": "a5",
"type": "int"
}]
}
}
}]
}
Schema V2:
{
"name": "foo",
"type": "record",
"fields": [{
"name": "products",
"type": {
"type": "array",
"items": {
"name": "product",
"type": "record",
"fields": [{
"name": "a1",
"type": "string"
}, {
"name": "a2",
"type": {"type": "fixed", "name": "a3", "size": 1}
}, {
"name": "a4",
"type": "int"
}, {
"name": "a5",
"type": "int"
}]
}
}
},
{
"name": "purchases",
"type": ["null",{
"type": "array",
"items": {
"name": "purchase",
"type": "record",
"fields": [{
"name": "a1",
"type": "int"
}, {
"name": "a2",
"type": "int"
}]
}
}]
}]
}
Thanks in advance.

I encountered the same issue. That might be a bug of avro, but you probably can work around by adding "default": null to the field of "purchase".
Check my blog for details: http://ben-tech.blogspot.com/2013/05/avro-schema-evolution.html

You can do opposite of it . Mean you can parse data schem 1 and write data from schema 2 . Beacause at write time it write data into file and if we don't provide any field at reading time than it will be ok. But if we write less field than read than it will not recognize extra field at reading time so , it will give error .

Best way is to have a schema mapping to maintain the schema like Confluent Avro schema registry.
Key Take Aways:
1. Unlike Thrift, avro serialized objects do not hold any schema.
2. As there is no schema stored in the serialized byte array, one has to provide the schema with which it was written.
3. Confluent Schema Registry provides a service to maintain schema versions.
4. Confluent provides Cached Schema Client, which checks in cache first before sending the request over the network.
5. Json Schema present in “avsc” file is different from the schema present in Avro Object.
6. All Avro objects extends from Generic Record
7. During Serialization : based on schema of the Avro Object a schema Id is requested from the Confluent Schema Registry.
8. The schemaId which is a INTEGER is converted to Bytes and prepend to serialized AvroObject.
9. During Deserialization : First 4 bytes are removed from the ByteArray. 4 bytes are converted back to INTEGER(SchemaId)
10. Schema is requested from the Confluent Schema Registry and using this schema the byteArray is deserialized.
http://bytepadding.com/big-data/spark/avro/avro-serialization-de-serialization-using-confluent-schema-registry/

Related

Avro Records with Unknown Key Names and Quantity but Known Values

I am looking to convert JSON to Avro without altering the shape of the data
A nested field in the JSON contains a variable number of keys, which are never known in advance. The record that branches off of each of these unknown nodes however is of known, well-defined shape.
An example of this input data is as shown below
{
"customers": {
"paul_ince": {
"name": "Mr Paul Ince",
"age": 54
},
"kim_kardashian": {
"name": "Ms Kim Kardashian",
"age": 41
},
"elon_musk": {
"name": "Elon Musk, Technoking of Tesla",
"age": 50
}
}
Now it would be more avro friendly of course to have customers lead to an array of the form
{
"customers": [
{
"customer_name": "paul_ince",
"name": "Mr Paul Ince",
"age": 54
},
{
...
}
]
}
But this would violate my constraint that the shape of the input data be unchanged.
This problem seems to manifest itself frequently if I rely on data from external sources or scrapes, or preexisting data that was never created for Avro.
In my head, the schema would look something like the below,
{
"fields": [
{
"name": "customers",
"type": {
"type": "record",
"name": "customers",
"fields": [
{
"name": $customer_name,
"type": {
"type": "record",
"name": $customer_name,
"fields": [
{
"name": "name",
"type": "string",
},
{
"name": "age",
"type": "int"
}
]
}
}
]
}
}
]
}
where $customer_name is an assignment value, defined on read. Just asking the question it feels like this violates fundamental avro but I must use Avro and I strongly desire to maintain the input shape of the data. It would be highly impractical to modify this, not least given how frequently this problem appears and how large and varied the data I need to transfer from JSON to Avro is

How to Create List of records in Avro Schema

I have a Avro Schema as mentioned below.
{"type":"record",
"namespace": "com.test",
"name": "bck",
"fields": [ {"name": "accountid","type": "string"},
{"name":"amendmentpositionid","type": "int"},
{"name":"booking","type":
{"type":"array","items":
{"namespace":"com.test",
"name":"bkkk",
"type":"record",
"fields":
[{"name":"accountid","type":"string"},{"name":"clientid","type":"int"},
{"name":"clientname","type":"string"},{"name":"exerciseid","type":"int"},
{"name":"hedgeid","type":"int"},{"name":"originatingpositionid","type":"int"},
{"name":"positionid","type":"int"},{"name":"relatedpositionid","type":"int"} ]}}}]}
I want to create one more record of same type as mentioned above. OR I mean to say that I want to create list of records where schema of each record is same as above. How can i achieve it in single Avro file schema?
The schema you provided already include an array of records. If my understanding is correct, you want to create another array of records using/containing this schema, which makes it an array of records within an array of records, in one schema file.
I hope this helps.
{
"type": "record",
"namespace": "com.test",
"name": "list",
"fields": [{
"name":"listOfBck","type":
{"type":"array","items":
{
"type": "record",
"namespace": "com.test",
"name": "bck",
"fields": [
{"name": "accountid","type": "string"},
{"name":"amendmentpositionid","type": "int"},
{"name":"booking","type":
{"type":"array","items":
{"namespace":"com.test",
"name":"bkkk",
"type":"record",
"fields": [
{"name":"accountid","type":"string"},{"name":"clientid","type":"int"},
{"name":"clientname","type":"string"},{"name":"exerciseid","type":"int"},
{"name":"hedgeid","type":"int"},{"name":"originatingpositionid","type":"int"},
{"name":"positionid","type":"int"},{"name":"relatedpositionid","type":"int"}
]
}
}
}
]
}
}
}]
}

Writing an array of multiple different Records to Avro format, into the same file

We have some legacy file format, which I would need to migrate to Avro storage. The tricky part is that the records basically have
some common fields,
a discriminator field and
some unique fields, specific to the type selected by the discriminator field
all of them stored in the same file, without any order, fully mixed with each other. (It's legacy...)
In Java/object-oriented programming, one could represent our records concept as the following:
abstract class RecordWithCommonFields {
private Long commonField1;
private String commonField2;
...
}
class RecordTypeA extends RecordWithCommonFields {
private Integer specificToA1;
private String specificToA1;
...
}
class RecordTypeB extends RecordWithCommonFields {
private Boolean specificToB1;
private String specificToB1;
...
}
Imagine the data being something like this:
commonField1Value;commonField2Value,TYPE_IS_A,specificToA1Value,specificToA1Value
commonField1Value;commonField2Value,TYPE_IS_B,specificToB1Value,specificToB1Value
So I would like to process an incoming file and write its content to Avro format, somehow representing the different types of the records.
Can someone give me some ideas on how to achieve this?
Nandor from the Avro users email list was kind enough to help me out with this answer, credits go to him; this answer is for the record just in case someone else hits the same issue.
His solution is simple, basically using composition rather than inheritance, by introducing a common container class and a field referencing a specific subclass.
With this approach the mapping looks like this:
{
"namespace": "com.foobar",
"name": "UnionRecords",
"type": "array",
"items": {
"type": "record",
"name": "RecordWithCommonFields",
"fields": [
{"name": "commonField1", "type": "string"},
{"name": "commonField2", "type": "string"},
{"name": "subtype", "type": [
{
"type" : "record",
"name": "RecordTypeA",
"fields" : [
{"name": "integerSpecificToA1", "type": ["null", "long"] },
{"name": "stringSpecificToA1", "type": ["null", "string"]}
]
},
{
"type" : "record",
"name": "RecordTypeB",
"fields" : [
{"name": "booleanSpecificToB1", "type": ["null", "boolean"]},
{"name": "stringSpecificToB1", "type": ["null", "string"]}
]
}
]}
]
}
}

How to have nested avro schema for confluent schema registry?

I tried different things with the following web ui
https://schema-registry-ui.landoop.com
I couldn't seem to put the following into the registry:
{
"namespace": "test.avro",
"type": "record",
"name": "test",
"fields": [
{
"name": "field1",
"type": "string"
},
{
"name": "field2",
"type": "record",
"fields":[
{"name": "field1", "type": "string" },
{"name": "field2", "type": "string"},
{"name": "intField", "type": "int"}
]
}
]
}
Also, is there a way to refer to another schema from inside the current one to create a compound/nested schema?
Have a look at the example at
https://github.com/Landoop/schema-registry-ui/issues/43
You need to define schema as an array - with the 1st element the nested record
and as a 2nd element the main avro record

default values in avro schema not written to Java class

I have a simple schema as follows:
{
"name": "owner",
"type": "record",
"doc": "todo",
"fields": [
{ "name": "version", "type": "int", "default": 1},
{ "name": "name", "type": "string" },
{ "name": "age", "type": "int" },
]
}
However when I use the avro-maven-plugin to generate a java object from this specificaion, it does not set the default value of version field to 1.
How do I make that happen?
Nevermind, It works fine as is.
I was looking at the generated Java class, and could not figure out where it was setting the default value to 1. But when I serialize it using json, I see the default value in the output. Also, the getter returns the default value as well.

Resources