How to have nested avro schema for confluent schema registry? - avro

I tried different things with the following web ui
https://schema-registry-ui.landoop.com
I couldn't seem to put the following into the registry:
{
"namespace": "test.avro",
"type": "record",
"name": "test",
"fields": [
{
"name": "field1",
"type": "string"
},
{
"name": "field2",
"type": "record",
"fields":[
{"name": "field1", "type": "string" },
{"name": "field2", "type": "string"},
{"name": "intField", "type": "int"}
]
}
]
}
Also, is there a way to refer to another schema from inside the current one to create a compound/nested schema?

Have a look at the example at
https://github.com/Landoop/schema-registry-ui/issues/43
You need to define schema as an array - with the 1st element the nested record
and as a 2nd element the main avro record

Related

How to extract a nested nullable Avro Schema

The complete schema is the following:
{
"type": "record",
"name": "envelope",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "row",
"fields": [
{
"name": "username",
"type": "string"
},
{
"name": "timestamp",
"type": "long"
}
]
}
]
},
{
"name": "after",
"type": [
"null",
"row"
]
}
]
}
I wanted to programmatically extract the following sub-schema:
{
"type": "record",
"name": "row",
"fields": [
{
"name": "username",
"type": "string"
},
{
"name": "timestamp",
"type": "long"
}
]
}
As you see, field "before" is nullable. I can extract it's schema by doing:
schema.getField("before").schema()
But the schema is not a record as it contains NULL at the beginning(UNION type) and I can't go inside to fetch schema of "row".
["null",{"type":"record","name":"row","fields":[{"name":"username","type":"string"},{"name":"tweet","type":"string"},{"name":"timestamp","type":"long"}]}]
I want to fetch the sub-schema because I want to create GenericRecord out of it. Basically I want to create two GenericRecords "before" and "after" and add them to the main GenericRecord created from full schema.
Any help will be highly appreciated.
Good news, if you have a union schema, you can go inside to fetch the list of possible options:
Schema unionSchema = schema.getField("before").schema();
List<Schema> unionSchemaContains = unionSchema.getTypes();
At that point, you can look inside the list to find the one that corresponds to the Type.RECORD.

How to Create List of records in Avro Schema

I have a Avro Schema as mentioned below.
{"type":"record",
"namespace": "com.test",
"name": "bck",
"fields": [ {"name": "accountid","type": "string"},
{"name":"amendmentpositionid","type": "int"},
{"name":"booking","type":
{"type":"array","items":
{"namespace":"com.test",
"name":"bkkk",
"type":"record",
"fields":
[{"name":"accountid","type":"string"},{"name":"clientid","type":"int"},
{"name":"clientname","type":"string"},{"name":"exerciseid","type":"int"},
{"name":"hedgeid","type":"int"},{"name":"originatingpositionid","type":"int"},
{"name":"positionid","type":"int"},{"name":"relatedpositionid","type":"int"} ]}}}]}
I want to create one more record of same type as mentioned above. OR I mean to say that I want to create list of records where schema of each record is same as above. How can i achieve it in single Avro file schema?
The schema you provided already include an array of records. If my understanding is correct, you want to create another array of records using/containing this schema, which makes it an array of records within an array of records, in one schema file.
I hope this helps.
{
"type": "record",
"namespace": "com.test",
"name": "list",
"fields": [{
"name":"listOfBck","type":
{"type":"array","items":
{
"type": "record",
"namespace": "com.test",
"name": "bck",
"fields": [
{"name": "accountid","type": "string"},
{"name":"amendmentpositionid","type": "int"},
{"name":"booking","type":
{"type":"array","items":
{"namespace":"com.test",
"name":"bkkk",
"type":"record",
"fields": [
{"name":"accountid","type":"string"},{"name":"clientid","type":"int"},
{"name":"clientname","type":"string"},{"name":"exerciseid","type":"int"},
{"name":"hedgeid","type":"int"},{"name":"originatingpositionid","type":"int"},
{"name":"positionid","type":"int"},{"name":"relatedpositionid","type":"int"}
]
}
}
}
]
}
}
}]
}

How to mix record with map in Avro?

I'm dealing with server logs which are JSON format, and I want to store my logs on AWS S3 in Parquet format(and Parquet requires an Avro schema). First, all logs have a common set of fields, second, all logs have a lot of optional fields which are not in the common set.
For example, the follwoing are three logs:
{ "ip": "172.18.80.109", "timestamp": "2015-09-17T23:00:18.313Z", "message":"blahblahblah"}
{ "ip": "172.18.80.112", "timestamp": "2015-09-17T23:00:08.297Z", "message":"blahblahblah", "microseconds": 223}
{ "ip": "172.18.80.113", "timestamp": "2015-09-17T23:00:08.299Z", "message":"blahblahblah", "thread":"http-apr-8080-exec-1147"}
All of the three logs have 3 shared fields: ip, timestamp and message, some of the logs have additional fields, such as microseconds and thread.
If I use the following schema then I will lose all additional fields.:
{"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "String"},
{"name": "message", "type": "string"}
]
}
And the following schema works fine:
{"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "String"},
{"name": "message", "type": "string"},
{"name": "microseconds", "type": [null,long]},
{"name": "thread", "type": [null,string]}
]
}
But the only problem is that I don't know all the names of optional fields unless I scan all the logs, besides, there will new additional fields in future.
Then I think out an idea that combines record and map:
{"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "String"},
{"name": "message", "type": "string"},
{"type": "map", "values": "string"} // error
]
}
Unfortunately this won't compile:
java -jar avro-tools-1.7.7.jar compile schema example.avro .
It will throw out an error:
Exception in thread "main" org.apache.avro.SchemaParseException: No field name: {"type":"map","values":"long"}
at org.apache.avro.Schema.getRequiredText(Schema.java:1305)
at org.apache.avro.Schema.parse(Schema.java:1192)
at org.apache.avro.Schema$Parser.parse(Schema.java:965)
at org.apache.avro.Schema$Parser.parse(Schema.java:932)
at org.apache.avro.tool.SpecificCompilerTool.run(SpecificCompilerTool.java:73)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
Is there a way to store JSON strings in Avro format which are flexible to deal with unknown optional fields?
Basically this is a schema evolution problem, Spark can deal with this problem by Schema Merging. I'm seeking a solution with Hadoop.
The map type is a "complex" type in avro terminology. The below snippet works:
{
"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "string"},
{"name": "message", "type": "string"},
{"name": "additional", "type": {"type": "map", "values": "string"}}
]
}

Avro-Tools JSON to Avro Schema fails: org.apache.avro.SchemaParseException: Undefined name:

I am trying to create two Avro schemas using the avro-tools-1.7.4.jar create schema command.
I have two JSON schemas which look like this:
{
"name": "TestAvro",
"type": "record",
"namespace": "com.avro.test",
"fields": [
{"name": "first", "type": "string"},
{"name": "last", "type": "string"},
{"name": "amount", "type": "double"}
]
}
{
"name": "TestArrayAvro",
"type": "record",
"namespace": "com.avro.test",
"fields": [
{"name": "date", "type": "string"},
{"name": "records", "type":
{"type":"array","items":"com.avro.test.TestAvro"}}
]
}
When I run the create schema on these two files the first one works fine and generates the java. The second one fails every time. It does not like the array items when I try and use the first Schema as the type. This is the error I get:
Exception in thread "main" org.apache.avro.SchemaParseException: Undefined name: "com.test.avro.TestAvro"
at org.apache.avro.Schema.parse(Schema.java:1052)
Both files are located in the same path directory.
Use the below avsc file:
[{
"name": "TestAvro",
"type": "record",
"namespace": "com.avro.test",
"fields": [
{
"name": "first",
"type": "string"
},
{
"name": "last",
"type": "string"
},
{
"name": "amount",
"type": "double"
}
]
},
{
"name": "TestArrayAvro",
"type": "record",
"namespace": "com.avro.test",
"fields": [
{
"name": "date",
"type": "string"
},
{
"name": "records",
"type": {
"type": "array",
"items": "com.avro.test.TestAvro"
}
}
]
}]

Avro schema evolution

I have two questions:
Is it possible to use the same reader and parse records that were written with two schemas that are compatible, e.g. Schema V2 only has an additional optional field compared to Schema V1 and I want the reader to understand both? I think the answer here is no, but if yes, how do I do that?
I have tried writing a record with Schema V1 and reading it with Schema V2, but I get the following error:
org.apache.avro.AvroTypeException: Found foo, expecting foo
I used avro-1.7.3 and:
writer = new GenericDatumWriter<GenericData.Record>(SchemaV1);
reader = new GenericDatumReader<GenericData.Record>(SchemaV2, SchemaV1);
Here are examples of the two schemas (I have tried adding a namespace as well, but no luck).
Schema V1:
{
"name": "foo",
"type": "record",
"fields": [{
"name": "products",
"type": {
"type": "array",
"items": {
"name": "product",
"type": "record",
"fields": [{
"name": "a1",
"type": "string"
}, {
"name": "a2",
"type": {"type": "fixed", "name": "a3", "size": 1}
}, {
"name": "a4",
"type": "int"
}, {
"name": "a5",
"type": "int"
}]
}
}
}]
}
Schema V2:
{
"name": "foo",
"type": "record",
"fields": [{
"name": "products",
"type": {
"type": "array",
"items": {
"name": "product",
"type": "record",
"fields": [{
"name": "a1",
"type": "string"
}, {
"name": "a2",
"type": {"type": "fixed", "name": "a3", "size": 1}
}, {
"name": "a4",
"type": "int"
}, {
"name": "a5",
"type": "int"
}]
}
}
},
{
"name": "purchases",
"type": ["null",{
"type": "array",
"items": {
"name": "purchase",
"type": "record",
"fields": [{
"name": "a1",
"type": "int"
}, {
"name": "a2",
"type": "int"
}]
}
}]
}]
}
Thanks in advance.
I encountered the same issue. That might be a bug of avro, but you probably can work around by adding "default": null to the field of "purchase".
Check my blog for details: http://ben-tech.blogspot.com/2013/05/avro-schema-evolution.html
You can do opposite of it . Mean you can parse data schem 1 and write data from schema 2 . Beacause at write time it write data into file and if we don't provide any field at reading time than it will be ok. But if we write less field than read than it will not recognize extra field at reading time so , it will give error .
Best way is to have a schema mapping to maintain the schema like Confluent Avro schema registry.
Key Take Aways:
1. Unlike Thrift, avro serialized objects do not hold any schema.
2. As there is no schema stored in the serialized byte array, one has to provide the schema with which it was written.
3. Confluent Schema Registry provides a service to maintain schema versions.
4. Confluent provides Cached Schema Client, which checks in cache first before sending the request over the network.
5. Json Schema present in “avsc” file is different from the schema present in Avro Object.
6. All Avro objects extends from Generic Record
7. During Serialization : based on schema of the Avro Object a schema Id is requested from the Confluent Schema Registry.
8. The schemaId which is a INTEGER is converted to Bytes and prepend to serialized AvroObject.
9. During Deserialization : First 4 bytes are removed from the ByteArray. 4 bytes are converted back to INTEGER(SchemaId)
10. Schema is requested from the Confluent Schema Registry and using this schema the byteArray is deserialized.
http://bytepadding.com/big-data/spark/avro/avro-serialization-de-serialization-using-confluent-schema-registry/

Resources