How to Create List of records in Avro Schema - avro

I have a Avro Schema as mentioned below.
{"type":"record",
"namespace": "com.test",
"name": "bck",
"fields": [ {"name": "accountid","type": "string"},
{"name":"amendmentpositionid","type": "int"},
{"name":"booking","type":
{"type":"array","items":
{"namespace":"com.test",
"name":"bkkk",
"type":"record",
"fields":
[{"name":"accountid","type":"string"},{"name":"clientid","type":"int"},
{"name":"clientname","type":"string"},{"name":"exerciseid","type":"int"},
{"name":"hedgeid","type":"int"},{"name":"originatingpositionid","type":"int"},
{"name":"positionid","type":"int"},{"name":"relatedpositionid","type":"int"} ]}}}]}
I want to create one more record of same type as mentioned above. OR I mean to say that I want to create list of records where schema of each record is same as above. How can i achieve it in single Avro file schema?

The schema you provided already include an array of records. If my understanding is correct, you want to create another array of records using/containing this schema, which makes it an array of records within an array of records, in one schema file.
I hope this helps.
{
"type": "record",
"namespace": "com.test",
"name": "list",
"fields": [{
"name":"listOfBck","type":
{"type":"array","items":
{
"type": "record",
"namespace": "com.test",
"name": "bck",
"fields": [
{"name": "accountid","type": "string"},
{"name":"amendmentpositionid","type": "int"},
{"name":"booking","type":
{"type":"array","items":
{"namespace":"com.test",
"name":"bkkk",
"type":"record",
"fields": [
{"name":"accountid","type":"string"},{"name":"clientid","type":"int"},
{"name":"clientname","type":"string"},{"name":"exerciseid","type":"int"},
{"name":"hedgeid","type":"int"},{"name":"originatingpositionid","type":"int"},
{"name":"positionid","type":"int"},{"name":"relatedpositionid","type":"int"}
]
}
}
}
]
}
}
}]
}

Related

How to extract a nested nullable Avro Schema

The complete schema is the following:
{
"type": "record",
"name": "envelope",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "row",
"fields": [
{
"name": "username",
"type": "string"
},
{
"name": "timestamp",
"type": "long"
}
]
}
]
},
{
"name": "after",
"type": [
"null",
"row"
]
}
]
}
I wanted to programmatically extract the following sub-schema:
{
"type": "record",
"name": "row",
"fields": [
{
"name": "username",
"type": "string"
},
{
"name": "timestamp",
"type": "long"
}
]
}
As you see, field "before" is nullable. I can extract it's schema by doing:
schema.getField("before").schema()
But the schema is not a record as it contains NULL at the beginning(UNION type) and I can't go inside to fetch schema of "row".
["null",{"type":"record","name":"row","fields":[{"name":"username","type":"string"},{"name":"tweet","type":"string"},{"name":"timestamp","type":"long"}]}]
I want to fetch the sub-schema because I want to create GenericRecord out of it. Basically I want to create two GenericRecords "before" and "after" and add them to the main GenericRecord created from full schema.
Any help will be highly appreciated.
Good news, if you have a union schema, you can go inside to fetch the list of possible options:
Schema unionSchema = schema.getField("before").schema();
List<Schema> unionSchemaContains = unionSchema.getTypes();
At that point, you can look inside the list to find the one that corresponds to the Type.RECORD.

Writing an array of multiple different Records to Avro format, into the same file

We have some legacy file format, which I would need to migrate to Avro storage. The tricky part is that the records basically have
some common fields,
a discriminator field and
some unique fields, specific to the type selected by the discriminator field
all of them stored in the same file, without any order, fully mixed with each other. (It's legacy...)
In Java/object-oriented programming, one could represent our records concept as the following:
abstract class RecordWithCommonFields {
private Long commonField1;
private String commonField2;
...
}
class RecordTypeA extends RecordWithCommonFields {
private Integer specificToA1;
private String specificToA1;
...
}
class RecordTypeB extends RecordWithCommonFields {
private Boolean specificToB1;
private String specificToB1;
...
}
Imagine the data being something like this:
commonField1Value;commonField2Value,TYPE_IS_A,specificToA1Value,specificToA1Value
commonField1Value;commonField2Value,TYPE_IS_B,specificToB1Value,specificToB1Value
So I would like to process an incoming file and write its content to Avro format, somehow representing the different types of the records.
Can someone give me some ideas on how to achieve this?
Nandor from the Avro users email list was kind enough to help me out with this answer, credits go to him; this answer is for the record just in case someone else hits the same issue.
His solution is simple, basically using composition rather than inheritance, by introducing a common container class and a field referencing a specific subclass.
With this approach the mapping looks like this:
{
"namespace": "com.foobar",
"name": "UnionRecords",
"type": "array",
"items": {
"type": "record",
"name": "RecordWithCommonFields",
"fields": [
{"name": "commonField1", "type": "string"},
{"name": "commonField2", "type": "string"},
{"name": "subtype", "type": [
{
"type" : "record",
"name": "RecordTypeA",
"fields" : [
{"name": "integerSpecificToA1", "type": ["null", "long"] },
{"name": "stringSpecificToA1", "type": ["null", "string"]}
]
},
{
"type" : "record",
"name": "RecordTypeB",
"fields" : [
{"name": "booleanSpecificToB1", "type": ["null", "boolean"]},
{"name": "stringSpecificToB1", "type": ["null", "string"]}
]
}
]}
]
}
}

How to have nested avro schema for confluent schema registry?

I tried different things with the following web ui
https://schema-registry-ui.landoop.com
I couldn't seem to put the following into the registry:
{
"namespace": "test.avro",
"type": "record",
"name": "test",
"fields": [
{
"name": "field1",
"type": "string"
},
{
"name": "field2",
"type": "record",
"fields":[
{"name": "field1", "type": "string" },
{"name": "field2", "type": "string"},
{"name": "intField", "type": "int"}
]
}
]
}
Also, is there a way to refer to another schema from inside the current one to create a compound/nested schema?
Have a look at the example at
https://github.com/Landoop/schema-registry-ui/issues/43
You need to define schema as an array - with the 1st element the nested record
and as a 2nd element the main avro record

How to mix record with map in Avro?

I'm dealing with server logs which are JSON format, and I want to store my logs on AWS S3 in Parquet format(and Parquet requires an Avro schema). First, all logs have a common set of fields, second, all logs have a lot of optional fields which are not in the common set.
For example, the follwoing are three logs:
{ "ip": "172.18.80.109", "timestamp": "2015-09-17T23:00:18.313Z", "message":"blahblahblah"}
{ "ip": "172.18.80.112", "timestamp": "2015-09-17T23:00:08.297Z", "message":"blahblahblah", "microseconds": 223}
{ "ip": "172.18.80.113", "timestamp": "2015-09-17T23:00:08.299Z", "message":"blahblahblah", "thread":"http-apr-8080-exec-1147"}
All of the three logs have 3 shared fields: ip, timestamp and message, some of the logs have additional fields, such as microseconds and thread.
If I use the following schema then I will lose all additional fields.:
{"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "String"},
{"name": "message", "type": "string"}
]
}
And the following schema works fine:
{"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "String"},
{"name": "message", "type": "string"},
{"name": "microseconds", "type": [null,long]},
{"name": "thread", "type": [null,string]}
]
}
But the only problem is that I don't know all the names of optional fields unless I scan all the logs, besides, there will new additional fields in future.
Then I think out an idea that combines record and map:
{"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "String"},
{"name": "message", "type": "string"},
{"type": "map", "values": "string"} // error
]
}
Unfortunately this won't compile:
java -jar avro-tools-1.7.7.jar compile schema example.avro .
It will throw out an error:
Exception in thread "main" org.apache.avro.SchemaParseException: No field name: {"type":"map","values":"long"}
at org.apache.avro.Schema.getRequiredText(Schema.java:1305)
at org.apache.avro.Schema.parse(Schema.java:1192)
at org.apache.avro.Schema$Parser.parse(Schema.java:965)
at org.apache.avro.Schema$Parser.parse(Schema.java:932)
at org.apache.avro.tool.SpecificCompilerTool.run(SpecificCompilerTool.java:73)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
Is there a way to store JSON strings in Avro format which are flexible to deal with unknown optional fields?
Basically this is a schema evolution problem, Spark can deal with this problem by Schema Merging. I'm seeking a solution with Hadoop.
The map type is a "complex" type in avro terminology. The below snippet works:
{
"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "string"},
{"name": "message", "type": "string"},
{"name": "additional", "type": {"type": "map", "values": "string"}}
]
}

Avro schema evolution

I have two questions:
Is it possible to use the same reader and parse records that were written with two schemas that are compatible, e.g. Schema V2 only has an additional optional field compared to Schema V1 and I want the reader to understand both? I think the answer here is no, but if yes, how do I do that?
I have tried writing a record with Schema V1 and reading it with Schema V2, but I get the following error:
org.apache.avro.AvroTypeException: Found foo, expecting foo
I used avro-1.7.3 and:
writer = new GenericDatumWriter<GenericData.Record>(SchemaV1);
reader = new GenericDatumReader<GenericData.Record>(SchemaV2, SchemaV1);
Here are examples of the two schemas (I have tried adding a namespace as well, but no luck).
Schema V1:
{
"name": "foo",
"type": "record",
"fields": [{
"name": "products",
"type": {
"type": "array",
"items": {
"name": "product",
"type": "record",
"fields": [{
"name": "a1",
"type": "string"
}, {
"name": "a2",
"type": {"type": "fixed", "name": "a3", "size": 1}
}, {
"name": "a4",
"type": "int"
}, {
"name": "a5",
"type": "int"
}]
}
}
}]
}
Schema V2:
{
"name": "foo",
"type": "record",
"fields": [{
"name": "products",
"type": {
"type": "array",
"items": {
"name": "product",
"type": "record",
"fields": [{
"name": "a1",
"type": "string"
}, {
"name": "a2",
"type": {"type": "fixed", "name": "a3", "size": 1}
}, {
"name": "a4",
"type": "int"
}, {
"name": "a5",
"type": "int"
}]
}
}
},
{
"name": "purchases",
"type": ["null",{
"type": "array",
"items": {
"name": "purchase",
"type": "record",
"fields": [{
"name": "a1",
"type": "int"
}, {
"name": "a2",
"type": "int"
}]
}
}]
}]
}
Thanks in advance.
I encountered the same issue. That might be a bug of avro, but you probably can work around by adding "default": null to the field of "purchase".
Check my blog for details: http://ben-tech.blogspot.com/2013/05/avro-schema-evolution.html
You can do opposite of it . Mean you can parse data schem 1 and write data from schema 2 . Beacause at write time it write data into file and if we don't provide any field at reading time than it will be ok. But if we write less field than read than it will not recognize extra field at reading time so , it will give error .
Best way is to have a schema mapping to maintain the schema like Confluent Avro schema registry.
Key Take Aways:
1. Unlike Thrift, avro serialized objects do not hold any schema.
2. As there is no schema stored in the serialized byte array, one has to provide the schema with which it was written.
3. Confluent Schema Registry provides a service to maintain schema versions.
4. Confluent provides Cached Schema Client, which checks in cache first before sending the request over the network.
5. Json Schema present in “avsc” file is different from the schema present in Avro Object.
6. All Avro objects extends from Generic Record
7. During Serialization : based on schema of the Avro Object a schema Id is requested from the Confluent Schema Registry.
8. The schemaId which is a INTEGER is converted to Bytes and prepend to serialized AvroObject.
9. During Deserialization : First 4 bytes are removed from the ByteArray. 4 bytes are converted back to INTEGER(SchemaId)
10. Schema is requested from the Confluent Schema Registry and using this schema the byteArray is deserialized.
http://bytepadding.com/big-data/spark/avro/avro-serialization-de-serialization-using-confluent-schema-registry/

Resources