I'm using the Avro 1.11.0 library to write data into the Avro files using Python 3.7. I'm having some doubts about the union type of the Avro. Please find below the two schemas.
{
"name" : "name",
"type" : ["null", "string"],
"columnName" : "name",
}
{
"name" : "name",
"type" : ["string", "null"],
"columnName" : "name",
}
First schema contains union type as "type" : ["null", "string"] and second schema contains union type as "type" : ["string", "null"].
So is there any difference between the above mentioned schemas?
The only difference is that the specification states that if you want to use a default value, it should correspond to the first type in the union.
For example, these would be valid:
{
"name" : "name",
"type" : ["null", "string"],
"columnName" : "name",
"default": null,
}
{
"name" : "name",
"type" : ["string", "null"],
"columnName" : "name",
"default": "foo",
}
But these would not:
{
"name" : "name",
"type" : ["null", "string"],
"columnName" : "name",
"default": "foo",
}
{
"name" : "name",
"type" : ["string", "null"],
"columnName" : "name",
"default": null,
}
Since a union that includes null tends to mean something like an optional field, most people would put null as the first option in the union so that they can set the default value to null.
Related
I am trying to learn avro and have a question in schema.
Some documents say
{
"name": "userid",
"type" : "string",
"logicalType" : "uuid"
},
And some say
{
"name": "userid",
"type" : {
"type" : "string",
"logicalType" : "uuid"
}
},
Which one is right? Or are they same?
Thank you!
I ran variants of your schemas with the avro tools "random" command ( aliased as avro below). It tries to generate a random value for a schema.
A schema with just this type using the nested type syntax to specify logicalType is rejected:
avro random --schema '{ "name": "userid", "type" : { "type": "string", "logicalType" : "uuid" } }' -
[...] No type: {"name":"userid","type":{"type":"string","logicalType":"uuid"}}
However, it works when putting the logicalType next to type:
avro random --schema ' { "type" : "string", "logicalType" : "uuid" }' -
[...] Objavro.schemaL{"type":"string","logicalType":"uuid"}avro.codecdeflate}�j�U�.�\�o���
Now, when we use it in a record, we get a warning when putting logicalType next to type:
avro random --schema '{ "type": "record", "fields": [ { "type" : "string", "logicalType" : "uuid", "name": "f"} ] , "name": "rec"}' -
[...] WARN avro.Schema: Ignored the rec.f.logicalType property ("uuid"). It should probably be nested inside the "type" for the field.
Objavro.schema�{"type":"record","name":"rec","fields":[{"name":"f","type":"string","logicalType":"uuid"}]}avro.codecdeflate��w�9�9�n�s�
The nested syntax is accepted without a warning:
avro random --schema '{ "type": "record", "fields": [ { "type" : { "type": "string", "logicalType" : "uuid" } , "name": "f"} ] , "name": "rec"}' -
�w<��qcord","name":"rec","fields":[{"name":"f","type":{"type":"string","logicalType":"uuid"}}]}avro.codecdeflate8��t
Further if we look at logicaltypes inside arrays:
avro random --count 1 --schema ' { "type": "array", "items": { "type" : "string", "logicalType" : "uuid" , "name": "f"} , "name": "farr" } ' -
[... random bits]
While the nested version fails:
avro random --count 1 --schema ' { "type": "array", "items": {"type": { "type" : "string", "logicalType" : "uuid" , "name": "f"} } , "name": "farr" } ' -
[...] No type: {"type":{"type":"string","logicalType":"uuid","name":"f"}}
It appears that if a logicalType is a type of a field in a record, you need to use the nested syntax.
Otherwise you need to use non-nested syntax.
I am trying to create avro schema for below json
{
"id": "TEST",
"status": "status",
"timestamp": "2019-01-01T00:00:22-03:00",
"comment": "add comments or replace it with adSummary data",
"error": {
"code": "ER1212132",
"msg": "error message"
}
}
the error object is optional, it could be
"error" :{}
Below is the avro schema without default value
{
"type" : "record",
"name" : "Order",
"fields" : [ {
"name" : "id",
"type" : "string"
}, {
"name" : "status",
"type" : "string"
}, {
"name" : "timestamp",
"type" : "string"
}, {
"name" : "comment",
"type" : ["null","string"],
"default": null
}, {
"name" : "error",
"type" : {
"type" : "record",
"name" : "error",
"fields" : [ {
"name" : "code",
"type" : "string"
}, {
"name" : "msg",
"type" : "string"
} ]
}
} ]
}
How can I add default value {} for error field in json.
{
"type" : "record",
"name" : "Order",
"fields" : [ {
"name" : "id",
"type" : "string"
}, {
"name" : "status",
"type" : "string"
}, {
"name" : "timestamp",
"type" : "string"
}, {
"name" : "comment",
"type" : ["null","string"],
"default": null
}, {
"name" : "error",
"type" : [{"type": "record", "fields":[{"name": "code", "type":"string"}, {"name": "msg", "type":"string"}]}, {"type": "record", "fields":[]}]
} ]
}
What is the difference between SerializableCoder and AvroCoder, and when should I use the one over the other on a customized data model? From the documentation page it seems that AvroCoder is more strict on the model schema, and SerializableCoder just need the model to implement Serializable interface which is essentially empty. The document for SerializableCoder did warn about not guaranteeing a deterministic encoding. Beside that, what would be a situation that one would choose AvroCoder over SerializableCoder?
The main difference is that AvroCoder use Avro Schemas, i.e., you would use AvroCoder for .avro files only.
An Avro schema is created using JSON format, like this:
{
"type" : "record",
"name" : "userInfo",
"namespace" : "my.example",
"fields" : [{"name" : "username",
"type" : "string",
"default" : "NONE"},
{"name" : "age",
"type" : "int",
"default" : -1},
{"name" : "phone",
"type" : "string",
"default" : "NONE"},
{"name" : "housenum",
"type" : "string",
"default" : "NONE"},
{"name" : "street",
"type" : "string",
"default" : "NONE"},
{"name" : "city",
"type" : "string",
"default" : "NONE"},
{"name" : "state_province",
"type" : "string",
"default" : "NONE"},
{"name" : "country",
"type" : "string",
"default" : "NONE"},
{"name" : "zip",
"type" : "string",
"default" : "NONE"}]
}
On the other hand, SerializableCoder implements the Java Serializable interface. This allows to have classes intended for object serialization, but there are no specific methods.
Additionally, avro schemas can be read by non-java applications.
I have a ConvertJsontoAvro processor in NiFi 1.4 and am having difficulty getting the proper datatype of decimal within the avro. The data is being transformed into bytes using logical Avro data types within ExecuteSQL processor, converting avro to Json using ConvertAvrotoJSON processor, and then using ConvertJsonToAvro processor to put into HDFS using PutParquet.
My schema is :
{
"type" : "record",
"name" : "schema",
"fields" : [ {
"name" : "entryDate",
"type" : [ "null", {
"type" : "long",
"logicalType" : "timestamp-micros"
} ],
"default" : null
}, {
"name" : "points",
"type" : [ "null", {
"type" : "bytes",
"logicalType" : "decimal",
"precision" : 18,
"scale" : 6
} ],
"default" : null
}]
}
My JSON:
{
"entryDate" : 2018-01-26T13:48:22.087,
"points" : 6.000000
}
I get an error for the avro saying
Cannont convert field points: Cannot resolve union : {"bytes": "+|Ð" not in ["null", {"type":"bytes","logicalType":"decimal","precision":18,"scale":6}]"
Is there some type of work around for this?...
Currently you cannot mix null type and logical types due to bug in Avro. Check this still unresolved issue:
https://issues.apache.org/jira/browse/AVRO-1891
Also the defaults value cannot be null. This should work for you:
{
"type" : "record",
"name" : "schema",
"fields" : [ {
"name" : "entryDate",
"type" : {
"type" : "long",
"logicalType" : "timestamp-micros"
},
"default" : 0
}, {
"name" : "points",
"type" : {
"type" : "bytes",
"logicalType" : "decimal",
"precision" : 18,
"scale" : 6
},
"default" : ""
}]
}
For anyone interested, I was able to set the decimal and a default value as null (in cases when the field is null or missing), currently using Nifi 1.14.0
{
"name": "value",
"type": [
"null",
{
"type": "bytes",
"logicalType": "decimal",
"precision": 8,
"scale": 4
}
],
"default": null
}
Suppose I’ve got the following schema:
{
"name" : "Profile",
"type" : "record",
"fields" : [
{ "name" : "firstName", "type" : "string" },
{ "name" : "address" , "type" : {
"type" : "record",
"name" : "AddressUSRecord",
"fields" : [
{ "name" : "address1" , "type" : "string" },
{ "name" : "address2" , "type" : "string" },
{ "name" : "city" , "type" : "string" },
{ "name" : "state" , "type" : "string" },
{ "name" : "zip" , "type" : "int" },
{ "name" : "zip4", "type": "int" }
]
}
}
]
}
I’m using a GenericRecord to represent each Profile that gets created. To add a firstName, it’s easy to do the following:
Schema sch = Schema.parse(schemaFile);
DataFileWriter<GenericRecord> fw = new DataFileWriter<GenericRecord>(new GenericDatumWriter<GenericRecord>()).create(sch, new File(outFile));
GenericRecord r = new GenericData.Record(sch);
r.put(“firstName”, “John”);
fw.append(r);
But how would I set the city, for example? How do I represent the key as a string that the r.put method can understand?
Thanks
For the schema above:
GenericRecord t = new GenericData.Record(sch.getField("address").schema());
t.put("city","beijing");
r.put("address",t);